Wei These

JOINT RESOURCE MANAGEMENT AND TASK
SCHEDULING FOR MOBILE EDGE COMPUTING
A Dissertation
Submitted to
the Temple University Graduate Board
in Partial Fulfillment
of the Requirements for the Degree of
DOCTOR OF PHILOSOPHY
by
Xinliang Wei
May, 2023
Examining Committee Members:
Dr. Yu Wang, Advisor, Dept. of Computer & Information Sciences

Dr. Yan Wang, Dept. of Computer & Information Sciences
Dr. Hongchang Gao, Dept. of Computer & Information Sciences
Dr. Zhu Han, External Member, University of Houston, Texas
©
by
Xinliang Wei
May, 2023
All Rights Reserved
ii
ABSTRACT
In recent years, edge computing has become an increasingly popular computing

paradigm to enable real-time data processing and mobile intelligence. Edge com-
puting allows computing at the edge of the network, where data is generated and
distributed at the nearby edge servers to reduce the data access latency and improve
data processing efficiency. In addition, with the advance of Artificial Intelligence of
Things (AIoT), not only millions of data are generated from daily smart devices, such
as smart light bulbs, smart cameras, and various sensors, but also a large number of
parameters of complex machine learning models have to be trained and exchanged
by these AIoT devices. Classical cloud-based platforms have difficulty communicat-
ing and processing these data/models effectively with sufficient privacy and security
protection. Due to the heterogeneity of edge elements including edge servers, mobile
users, data resources, and computing tasks, the key challenge is how to effectively
manage resources (e.g. data, services) and schedule tasks (e.g. ML/FL tasks) in the
edge clouds to meet the QoS of mobile users or maximize the platform’s utility. To
that end, this dissertation studies joint resource management and task scheduling for
mobile edge computing.
The key contributions of the dissertation are two-fold. Firstly, we study the data
placement problem in edge computing and propose a popularity-based method as well
as several load-balancing strategies to effectively place data in the edge network. We
further investigate a joint resource placement and task dispatching problem and for-
mulate it as an optimization problem. We propose a two-stage optimization method
and a reinforcement learning (RL) method to maximize the total utilities of all tasks.
Secondly, we focus on a specific computing task, i.e., federated learning (FL), and
study the joint participant selection and learning scheduling problem for multi-model
federated edge learning. We formulate a joint optimization problem and propose sev-
eral multi-stage optimization algorithms to solve the problem. To further improve the
FL performance, we leverage the power of the quantum computing (QC) technique
and propose a hybrid quantum-classical Benders’ decomposition (HQCBD) algorithm
as well as a multiple-cuts version to accelerate the convergence speed of the HQCBD
algorithm. We show that the proposed algorithms can achieve the consistent optimal
value compared with the classical Benders’ decomposition running in the classical
iii
CPU computer, but with fewer convergence iterations. We have also demonstrated
that the hybrid quantum-classical Benders’ decomposition technique has the potential
to be applied to solve a larger-scale scenario in the near future.
Keywords: Resource Management, Edge Computing, Federated Learning, Multi-
stage Optimization, Hybrid Quantum-Classical Technique
iv
ACKNOWLEDGEMENTS
One of the most meaningful experiences of my life has been and will always be
pursuing my doctorate at Temple University. The obstacles and challenges I faced
over the past four and a half years and eventually overcame have been very beneficial
to me. I have developed the skills necessary to be a qualified researcher throughout
this process, as well as a rigorous attitude toward research. I want to sincerely thank
everyone who has supported and helped me.
First of all, I would like to express my deepest gratitude to my Ph.D. advisor,
Prof. Yu Wang. During the past four years, he has been so patient in guiding me in
my research and taught me invaluable lessons in both doing research and handling
problems in life. Without his innumerable and continuous support, I would have
not been able to accomplish as much as I have. Prof. Wang has devoted his time
and efforts to advising my research, discussing popular research topics, and sharing
his insightful ideas with me. he was always trying his best to help and support me
by introducing collaborators in specific areas to me so as to motivate me with new
insights. He also gave me great help and suggestions during my job interviews. I am
so fortunate to have Prof. Yu Wang as my advisor when pursuing my Ph.D. degree.
As introduced by Prof. Yu Wang, I have the chance to collaborate with Prof.
Zhu Han and Prof. Lei Fan at the University of Houston, as well as Prof. Yuanx-
iong Guo at the University of Texas at San Antonio. All of them helped me a lot
in the recent research regarding the hybrid quantum-classical technique. I would like
to thank Prof. Zhu Han who guided me in exploring the hybrid quantum-classical
solutions to solve the joint optimization problem in federated learning and the in-
tegrated space-air-ground networks. At the same time, I appreciate Prof. Lei Fan
who helped me understand and improve the formulation of the optimization problem
from a mathematical perspective. Moreover, the discussion with Prof. Yuanxiong
Guo always inspires me to understand the nature of the problem and spot potential
problems.
My thanks are also extended to my Ph.D. committee members: Prof. Yan Wang,
and Prof. Hongchang Gao for their time and valuable comments on this dissertation.
Their suggestions help improve the quality of this dissertation to a large extent. I
would like to thank all my current and former colleagues in the department as well as
v
visiting scholars. I appreciate their friendship and the happy time we spent together
at Temple University.
Last but not least, I want to thank my girlfriend, Dr. Siyun Chen, for her emo-
tional and spiritual support. Her love and encouragement are the strongest driving
force for the completion of my doctorate program. I would also like to express my
deepest gratitude to my beloved parents for their understanding of my study abroad.
vi
TABLE OF CONTENTS
Page
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
ACKNOWLEDGEMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
CHAPTER
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Major Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Dissertation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 POPULARITY-BASED DATA PLACEMENT . . . . . . . . . . . . . . . 4
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Data Popularity and Design Overview . . . . . . . . . . . . . . . . . 6
2.2.1 Network Models and Data Placement Problem . . . . . . . . . 6
2.2.2 Data Popularity . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.3 Design Overview . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Virtual Coordinate Construction . . . . . . . . . . . . . . . . . . . . 11
2.3.1 Calculating Coordinates of Data Items based on Data Popularity 11
2.3.2 Calculating Coordinates of Edge Servers based on Network Dis-
tance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Data Placement and Retrieve . . . . . . . . . . . . . . . . . . . . . . 15
vii
2.4.1 Placing Data to Edge Servers . . . . . . . . . . . . . . . . . . 15
2.4.2 Retrieving Data from Edge Servers . . . . . . . . . . . . . . . 16
2.5 Data Placement with Limited Storage . . . . . . . . . . . . . . . . . . 18
2.5.1 Processing Order for Data Placement . . . . . . . . . . . . . . 18
2.5.2 Offloading Choice . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5.3 Data Retrieve . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6 Data Placement with Multiple Replicas . . . . . . . . . . . . . . . . . 22
2.6.1 Number of Replicas . . . . . . . . . . . . . . . . . . . . . . . . 23
2.6.2 Placing Replicas . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.7.1 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.7.2 Comparison with Existing Methods . . . . . . . . . . . . . . . 27
2.7.3 Global Retrieve vs Local Retrieve . . . . . . . . . . . . . . . . 28
2.7.4 Placement Strategies with Storage Limits . . . . . . . . . . . . 30
2.7.5 Placement Strategies with Data Replicas . . . . . . . . . . . . 31
2.8 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.9 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3 JOINT RESOURCE PLACEMENT AND TASK DISPATCHING . . . . . 36
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2 System Models and The Optimization . . . . . . . . . . . . . . . . . 38
3.2.1 Network and System Models . . . . . . . . . . . . . . . . . . . 38
3.2.2 Resource Placement . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.3 Task Dispatching . . . . . . . . . . . . . . . . . . . . . . . . . 41
viii
3.2.4 Joint Optimization Problem . . . . . . . . . . . . . . . . . . . 43
3.3 Two-Stage Optimization Method . . . . . . . . . . . . . . . . . . . . 43
3.3.1 Two-Stage Optimization . . . . . . . . . . . . . . . . . . . . . 43
3.3.2 Joint Optimization across Two Timescales . . . . . . . . . . . 45
3.4 Reinforcement Learning based Method . . . . . . . . . . . . . . . . . 47
3.4.1 RL Framework: State, Action, and Reward . . . . . . . . . . . 48
3.4.2 DDPG RL Algorithm . . . . . . . . . . . . . . . . . . . . . . . 49
3.4.3 RL Method across Two Timescales . . . . . . . . . . . . . . . 51
3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.5.2 Overall Performance . . . . . . . . . . . . . . . . . . . . . . . 54
3.5.3 Running Time and Convergence of OPT . . . . . . . . . . . . 55
3.5.4 OPT across Two Timescales with Dynamic Status . . . . . . . 56
3.5.5 Performance and Convergence of RL . . . . . . . . . . . . . . 58
3.6.1 Resource Placement/Management . . . . . . . . . . . . . . . . 59
3.6.2 Task Offloading/Dispatching . . . . . . . . . . . . . . . . . . . 61
3.6.3 Deep Reinforcement Learning . . . . . . . . . . . . . . . . . . 62
3.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4 JOINT PARTICIPANT SELECTION AND SCHEDULING IN FL . . . . 64
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2.1 Edge Cloud Model . . . . . . . . . . . . . . . . . . . . . . . . 66
ix
4.2.2 Federated Learning over Edge . . . . . . . . . . . . . . . . . . 66
4.3 Joint Participant Selection and Learning Optimization Problem . . . 70
4.3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . 70
4.3.2 Cost Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.4 Our Proposed Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.4.1 Three-Stage Methods . . . . . . . . . . . . . . . . . . . . . . . 73
4.4.2 Two-Stage Methods . . . . . . . . . . . . . . . . . . . . . . . . 76
4.4.3 Time Complexity . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.5 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.5.1 Environment Setup . . . . . . . . . . . . . . . . . . . . . . . . 78
4.5.2 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . 80
4.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5 QUANTUM-ASSISTED SCHEDULING ALGORITHMS . . . . . . . . . . 90
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.2 System Model and Problem Formulation . . . . . . . . . . . . . . . . 92
5.2.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.2.2 Federated Learning Model . . . . . . . . . . . . . . . . . . . . 93
5.2.3 Cost Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.2.4 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . 96
5.3 Hybrid Quantum Assisted Benders’ Decomposition (HQCBD) Methods 97
5.3.1 Benders’ Decomposition . . . . . . . . . . . . . . . . . . . . . 97
5.3.2 Classical Optimization for Subproblem . . . . . . . . . . . . . 99
x
5.3.3 Quantum Formulation for Master Problem . . . . . . . . . . . 100
5.3.4 HQCBD Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 102
5.3.5 Multiple Cuts Version . . . . . . . . . . . . . . . . . . . . . . 103
5.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.4.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . 106
5.5.1 Federated Learning . . . . . . . . . . . . . . . . . . . . . . . . 111
5.5.2 Client Selection and Learning Scheduling . . . . . . . . . . . . 112
5.5.3 Hybrid Quantum Optimization . . . . . . . . . . . . . . . . . 113
5.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6 DISSERTATION CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . 115
6.1 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
xi
LIST OF FIGURES
2.1 A typical edge computing environment. . . . . . . . . . . . . . . . . . 5
2.2 Examples of two placement strategies of two data at two servers. . . . 8
2.3 Virtual-space-based approach. . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Framework of the proposed data placement solution in a software-
defined edge network infrastructure. . . . . . . . . . . . . . . . . . . . 10
2.5 An example of a physical network topology and the relation of shortest
path length. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.6 Illustration of the virtual coordinates of two data items in the virtual
plane. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.7 Illustration of basic data placement. . . . . . . . . . . . . . . . . . . . 16
2.8 Illustration of global retrieve vs local retrieve. . . . . . . . . . . . . . 17
2.9 Illustration of two methods for Find Offloading Server. . . . . . . 21
2.10 Illustration of the virtual coordinates of data replicas generated by our
method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.11 Comparison with existing data placement methods. . . . . . . . . . . 28
2.12 Comparison of global retrieve and local retrieve in OUR-B. . . . . . . 29
2.13 Comparison of global retrieve and local retrieve in OUR-B and OUR-S. 30
2.14 Distribution of placed data items among servers for OUR-B and four
OUR-S strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.15 Loads among servers with multiple data replicas. . . . . . . . . . . . 32
2.16 Comparison of placement strategies with multiple replicas. . . . . . . 32
xii
3.1 A typical edge cloud environment. . . . . . . . . . . . . . . . . . . . . 37
3.2 Illustration of joint resource placement and task dispatching across two
timescales. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3 The architecture of Actor-Critic RL framework. . . . . . . . . . . . . 48
3.4 The architecture of DDPG RL Algorithm. The circled numbers are
the corresponding steps. . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.5 Resource placement and task dispatching via deep reinforcement learn-
ing across two timescales with two DDPG models. . . . . . . . . . . . 52
3.6 Overall performance of four methods in one timescale. . . . . . . . . . 54
3.7 Running time and convergence of OPT. . . . . . . . . . . . . . . . . . 55
3.8 Performance of OPT across two timescales with dynamic status. . . . 56
3.9 Convergence of RL under different timescales. . . . . . . . . . . . . . 58
3.10 Convergence of RL under different batch size and learning rate. . . . 59
4.1 Example of multi-model FL over the edge. . . . . . . . . . . . . . . . 65
4.2 The training process of an FL model within the edge network at dif-
ferent time periods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.3 The problem decomposition and design of our proposed multi-stage
algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.4 An example of edge cloud topology. . . . . . . . . . . . . . . . . . . . 78
4.5 Performance comparison with different metrics. . . . . . . . . . . . . 81
4.6 Impact of the number of FL models on costs. . . . . . . . . . . . . . . 82
4.7 Impact of the number of FL workers on costs. . . . . . . . . . . . . . 83
xiii
4.8 Comparison of two different processing orders of FL models. . . . . . 84
4.9 Training loss with LR models/tasks and the impact of FL workers. . . 85
4.10 Training accuracy with three FL tasks and the impact of FL workers. 86
5.1 The training process of distributed federated learning. . . . . . . . . . 91
5.2 The proposed HQCBD framework. . . . . . . . . . . . . . . . . . . . 97
5.3 Flow of HQCBD with a single cut and multi cuts. . . . . . . . . . . . 102
5.4 Performance of HQCBD: its convergence. . . . . . . . . . . . . . . . . 107
5.5 Comparison of the real solver accessing time and gains of MBD over
CBD in different cases. . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.6 Performance of MBD: its convergence. . . . . . . . . . . . . . . . . . 109
5.7 Performance comparison with existing methods. . . . . . . . . . . . . 110
xiv
LIST OF TABLES
3.1 RL Hyper Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.1 Parameters Setting for Edge Cloud and FL . . . . . . . . . . . . . . . 79
5.1 Iteration comparison of CBD and HQCBD over three different cases. 107
5.2 Solver accessing time (ms) comparison of CBD and HQCBD. . . . . . 108
5.3 Iteration of CBD and MBD with different σ. . . . . . . . . . . . . . . 109
xv
CHAPTER 1
INTRODUCTION
1.1 Background
Recently, there has been tremendous growth in mobile edge computing in both
academia and industry due to its advances over traditional cloud computing (e.g.,
low latency, agility, and privacy). Especially as the increasing amount of data and
services offered by diverse applications and IoT/smart devices, network operators and
service providers are likely to build and deploy computing resources (such as data,
models, and services) at the edge of the network near users to shorten the response
time and support real-time intelligence applications.
A typical edge computing environment consists of mobile users, edge clouds (in-
cluding multiple edge servers connected by the edge network), and a remote cloud
(usually within data centers). Each edge server is generally deployed at the network
edge near mobile users and owns specific storage, CPU, and memory capacity. Mobile
users can generate a couple of computation tasks at any location which requests to
be dispatched at edge servers with sufficient resources (i.e., internal computation re-
sources such as CPU, memory, storage) and may also require certain data or services
(i.e., external resources such as training data or machine learning services). Note
that the types of computing tasks from mobile users/devices are heterogeneous due
to diverse settings and applications. For example, some tasks may only request data
(e.g. image, video) or machine learning (ML) model from the edge network, and
then process it locally or perform ML computation based on the model at the local
edge server. Some tasks may request computation at other edge servers with certain
computation services, such as video analysis, speech recognition, and 3D rendering.
Some tasks may need a combination of data, services, and computation resources,
such as distributed federated learning or interactive augmented reality.
Despite the superiority brought by edge computing, however, due to the hetero-
geneity of edge elements including edge servers, mobile users, data resources, and
computing tasks, the key challenge is how to effectively manage resources (e.g. data,
services) and schedule tasks (e.g. ML/FL tasks) in the edge clouds to meet the QoS
1
of mobile users or maximize the platform’s utility. The goal of this dissertation is to
build practical solutions to solve the joint resource management and task scheduling
for mobile edge computing. We aim to address the following questions:
1. How and where to place resources in the edge network to minimize the total
accessing cost of all mobile users?
2. Where to dispatch the computing tasks to maximize the total utility of per-
formed tasks?
3. In the specific case of federated learning tasks, how to optimally select partici-
pants, and determine the local learning rate as well as the learning topology for
multi-model federated learning scenarios to minimize the total learning cost of
all FL models?
4. How to handle the edge network dynamics across different timescales or the
large-scale edge network scenario?
1.2 Major Contributions

The major contributions of this dissertation can be summarized as follows:
• First, we propose a popularity-based data placement strategy to minimize the

average accessing cost of data items. Our proposed method maps more popu-
lar data closer to the network center and thus it is placed on a nearby server,
which shortens the shortest paths during the data retrieval process and reduces
the overall response latency. We also propose several offloading and replica-
tion strategies which can make smart offloading and replication decisions based
on data popularity, to further reduce the pressure on overloaded servers and
improve the overall performance.
• Second, we investigate a joint optimization problem considering the storage,

CPU, and memory constraints as well as the edge server status for resource
placement and task dispatching in mobile edge computing. We then propose two
alternative approaches, two-stage optimization and deep reinforcement learning,
to solve the joint optimization problem. Both methods can be applied across
2
two timescales to deal with different dynamics of tasks and resources in mobile
edge computing.
• Next, we formulate a new joint participant selection and learning schedule prob-
lem of multi-model federated edge learning as a mixed-integer programming
problem, with a goal to minimize the total FEL cost while satisfying various
constraints. We decouple the original optimization problem into two or three
sub-problems and then propose three algorithms to effectively find participants
and learning rates for each FEL model, by iteratively solving the sub-problems.
• Last but not least, we propose a novel Hybrid Quantum-Classical Benders’

Decomposition (HQCBD) algorithm to tackle the joint participant selection
and learning schedule problem. By leveraging the combination of quantum
computing and classical optimization technique, our HQCBD algorithm can
quickly converge to the desired solution as the classical BD algorithm does but
with much fewer iterations and faster speeds. We further present a multiple-
cuts version of HQCBD (MBD), to accelerate the convergence speed by taking
multiple outputs from the quantum annealer to form multiple cuts in each
round. By selecting various numbers of cuts, MBD can achieve varying levels
of performance improvement.
1.3 Dissertation Overview

In this dissertation, we first briefly introduce the background, motivation, and
research contributions. In Chapter 2, we present the popularity-based data placement
strategy for mobile edge computing. In Chapter 3, we investigate the joint resource
placement and task dispatching problem in mobile edge computing. In Chapter 4, we
explore the joint participant selection and learning schedule problem of multi-model
federated edge learning. In Chapter 5, we further propose a novel quantum-assisted
scheduling algorithm to solve the joint participant selection and learning schedule
problem. The dissertation conclusion and future works are discussed in Chapter 6.
3
CHAPTER 2
POPULARITY-BASED DATA PLACEMENT
2.1 Introduction
Edge computing has grown in popularity as a computing paradigm for enabling
real-time data processing and mobile intelligence in recent years. Edge computing
refers to computing at the network’s edge, where data is generated and distributed
at nearby edge servers to reduce data access latency and improve data processing
efficiency. One of the key challenges in data-intensive edge computing is determining
how to effectively place data at edge clouds so that access latency to the data is
minimized.
As shown in Fig. 2.1, a typical edge computing environment consists of several
entities: mobile user, edge server, edge network, and remote cloud. Unlike the cloud
environment, edge servers are geographically dispersed at the edge of the network
near the mobile users and own heterogeneous computing and storage capability [95,
31, 110, 10, 65, 72, 128]. Each edge server can provide services for those mobile
users in the specific nearby area by holding some data/models and performing the
computation task based on data/models. Hereafter, we use data to refer to both
data and models as long as they are required for performing the service requested
by mobile users1 . When a mobile user requests data, its request is forwarded to the
nearest edge server. If the edge server has the data, it can respond to the mobile
user immediately with the data (as Data C in Fig. 2.1) or perform the corresponding
computing service for the user. Otherwise, the edge server has to retrieve the data
from other edge servers (Data A or B) or even from the remote cloud (Data F). Data
placement is a critical issue in edge computing since the location of data affects the
response latency of the requested service. If the data is stored at a nearby edge server,
the service can be performed very quickly, while a request needed to access a remote
cloud takes much longer to be performed. In addition, as shown in Fig. 2.1, multiple
1
Here, we do not differentiate the personal data or public data, as long as the data/model will be
used/shared by multiple users at different locations. Also, different security and privacy protection
techniques [135, 57, 50] can be applied before the data placement.
4
Data F
Remote Cloud
Data D Data B
Edge Server Edge Server

Data A Data C
Core Network
Edge Server Edge Server
Data E
Base Station Edge Server Base Station
Request Data C
Request Data B Base Station
Request Data F
Mobile Users
Mobile Users
Request Data B Request Data A
Mobile Users
Figure 2.1: A typical edge computing environment.
mobile users at different locations may request the same data (Data B) and different
data has diverse popularity (i.e., different number of requests from users). Therefore,
in this chapter, we study the data placement problem in edge computing with the
consideration of data popularity.
Data placement has been well studied in distributed systems [17, 1, 13, 59, 103,
100, 88, 19, 4, 49, 125, 26, 109, 141, 107, 133]. However, edge computing has its own
characteristics [95], such as proximity, fluctuation, and heterogeneity. Edge servers
deployed in the edge network are in the proximity of mobile users compared with
the distributed system (e.g. cloud computing). So it improves the speed of data
processing as a direct result of lower latency. In addition, devices are usually user-
controlled and can leave the edge network at any time. That means the network
status is fluctuating over time. Furthermore, the topology of edge environments is
heterogeneous and dynamic, which will bring another challenge to the data place-
ment, e.g. how to maintain the existing data already stored in the edge server when
the topology is changed. Thus data placement problem in edge computing has also
drawn significant attention from researchers recently [93, 55, 45, 12]. But most of
them formulate the data placement problem as an optimization problem and leverage
complex optimization solvers to tackle it. Such methods suffer from high computa-
5
tion and communication overheads, which makes them not suitable for large-scale
systems. Most recently, Xie et al. [123, 122] proposed a novel virtual space-based
method, which maps both switches and data indexes/items into a virtual space and
places data based on virtual distance in the space. Their method can enable efficient
retrieval via greedy forwarding. However, none of them consider data popularity
when placing data on edge servers.
In this section, we investigate the static data placement strategy based on data
popularity in edge computing to reduce the average forwarding path length of data.
Inspired by [123, 122], we also adopt a virtual-space-based placement method with
greedy routing-based retrieve, but take into consideration of data popularity when
we generate the coordinates of data items. Based on an observation that in a dense
network, the node in the central region has a smaller shortest path to other areas
compared with nodes in the surrounding regions, we carefully design our mapping
strategy so that a popular data item is placed closer to the network center in the
virtual plane. Then the placement of data is purely based on the distance between
the data item and the edge server in the virtual plane. To address the storage limits at
servers and balance the load among edge servers, we further propose several placement
strategies which either offload data items to other servers when the assigned server
is overloaded or place multiple replicas of the same data item to reduce the assigned
load of servers. In both cases, we do take data popularity into consideration when
designing the offloading and replication strategies. Simulation results show that our
proposed strategies can achieve better performance compared to existing solutions
[123, 122]. Moreover, both the offloading and replication strategies can effectively
handle the storage pressure of overloaded edge servers.
2.2 Data Popularity and Design Overview
2.2.1 Network Models and Data Placement Problem

Generally, we consider a typical edge computing environment as shown in Fig. 2.1,
where an edge network G(V, E) connects N edge servers with M links. Here V =
{v1 , v2 , · · · , vN } and E = {e1 , e2 , · · · , eM } denote the set of edge servers and the set
of direct links among them, respectively. For each edge server vi , we assume that
6
it has a specific maximal storage capacity ci = c(vi ). Let lij = l(vi , vj ) to represent
the shortest path length from edge server vi to vj in G, we then have a distance
matrix L = {lij } which holds lengths of all shortest paths in the edge network.
Assume that we have W data items, D = {d1 , d2 , · · · , dW }, in the system. Each
data di has a specific data size si = s(di ) and data popularity pi = p(di ) (which
will be explained in the next subsection). For each of data item di , we need to find
an edge server vj to hold it. Then the data placement problem can be represented
as finding a mapping f from D to V , where f (di ) = vj . The goal of the data
placement problem is to find a mapping to minimize the average access cost (or
delay) to stored data items in edge network G and also balance the load among edge
servers. Xie et al. [122] proposed a nice virtual-space-based data placement strategy
for edge computing problems however they did not consider data popularity among
data items di . Compared with a complex optimization-based data placement strategy,
the virtual-space-based method is much simple and easy to implement.
2.2.2 Data Popularity

Data popularity measures how much a given piece of data is requested by the users
in a system. This indicates the importance of that data. Therefore, it is one of the
most important parameters in the design of various data-centric distributed systems
and enables more intelligent data management, such as file assignment in parallel
I/O system [44], replication management in distributed storage systems [115, 62],
load balancing in content delivery networks [32], and coordinated caching in named
data networking [47]. Hamdeni et al. [28] provided a nice survey on data popularity
and highlighted its importance in replication management in distributed systems.
Taking popularity into account allows us to better place the data or their replicas to
avoid overloaded sites in any distributed systems.
For the data placement problem in edge computing, data popularity is also critical.
First, data placement aims not only to minimize the data access delay but also load
balancing among edge servers, thus data items have to be placed among different
servers. Obviously, placing more popular data items at the edge server with a shorter
delay within the network can significantly reduce the data access cost during data
retrievals, since popular data are repeatedly requested by various users from all edge
7
3
o reduce the computation TABLE 1

data placement optimiza- Path length comparison of two
sum of shortest pathplacement
length strategies
value, transmission cost, of data item d i in placement f
d1 d2
ocks, and the formulated
f1 95 81
y a tabu search algorithm f2 81 95
em. However, again these
ally suffer from poor sta-
bach et al. [23] have also f1(d2)= v 2 f2(d1)= v 2
d task placement in edge f1(d1)=v 1
iple context dimensions. f1(d2)= v 2 f2(d2)= v 1 f2(d1)= v 2
proposed data manage- f (d )=v
1 1 1 (a) Placement f
ware replication, where the 1 f2(d2)=(b)vPlacement f 2
1
rategy is tuned based on
a size, remaining storage,
Figure 2.2: Examples of twof 1placement strategies(b)
(a) Placement ofPlacement
two dataf at two servers.
2
Fig. 2. Examples of two placement strategies of two data at two servers.
26] have studied caching
servers. For example, Fig. 2.2 shows an example of two placement strategies, which
computing environments. (which will be explained in the next subsection). For each
place two of
ake resources and wireless data
dataitems
itemd1diand , wed2need
at servers
to findv1an andedgev2 respectively
server vj toand holddifferently.
nd formulate the caching
Assume that it. Then
p(d2 ) the
> p(d data placement
1 ), thus there areproblem can befrom
more requests represented as to d2 ’s
other servers
inear programminglocation
prob- thanfinding a mapping
d1 ’s. With from D tothe
differentf placements, V ,routes
whereoffshortest vj . The
(di ) = paths are different
ximation algorithm based goal of data placement problem is to find a mapping to
(as blue and
gorithm and a distributed red treesthe
minimize marked
average in the figure),
access costand
(or vdelay)
2 has shorter
to storedpaths to all other
data
the data-sharing problem
servers thanitems in edge
v1 does. Thenetwork
table in Gtheand also
figure balance
shows the load
the total among
length of all shortest
d a coordinate-based datato each
paths edge servers.
data item underXie et [25] proposed
twoal.placement strategies.a nice
It is virtual-space
obvious that placement
e efficient data sharing in based data placement strategy for edge computing problem
f1 has better performance since the red shortest path tree has less path length (81)
witches and data indexes however they did not consider data popularity among data
ted coordinates, andthan then items
the blue . Compared
onedi(95) withwhile
in Fig. 2.2(a) complex optimization-based
reversed data in this
in Fig. 2.2(b). Therefore,
r each data based on the we placement
paper, introduce data strategy, the virtual
popularity space
to assist the based method isstrategy
data placement much in edge
ations showed that both simple and easy to implemented.
computing.
orwarding table sizes for
al. 3.2data
exes are efficient. Xie etAlthough Data Popularity
popularity has been widely used in distributed systems, to our best
l-space method to handle
knowledge,Data mostpopularity
of the existing measures how much
data placement a givenfor
strategies piece
edgeofcomputing
data is do not
edge computing with an requested
consider data by the
popularity. The users
onlyinexception
a system. is This
[45], gives
where an theindication
authors considered
al Voronoi tesselation to of the importance of that data. Therefore, it is one of the
servers. Both [24] anddata popularity
[25] as part of their
most important estimation
parameters of the
in the value of
design of various
the data data-
block in their
ment with data popularity
formulatedcentric distributed
placement problem.systems and they
Particularly, enable more intelligent
compute data based
the data popularity
placement methodonwith management,
the access frequencies ofsuch as file
the data assignment
blocks and the timein parallel
intervalI/O system
between two accesses
ut they do not consider the [34], replication management in distributed storage systems
and use it[35],
as one of load
[36], the parameters
balancing in in their
contentutility function.
delivery The data
networks [37],placement
s of resource management
problem isand thencoordinated
formulated ascachinga complex in combinatorial
named data optimization
networkingproblem [38]. solved
h as virtual networkbyfunc-
a tabu Hamdeni et al. [39]
search algorithm. provided
Different a nice
from their surveyweonusedata
solution, datapop-
popularity in
placement [29], [30], and ularity and highlighted its importance in the replication
virtual space mapping where data items are mapped to a virtual space based on their
ese problems are different management in distributed systems. It is clear that taking
, and their solutionsdata
could popularity
popularity, and then intotheaccount
placement allows to is
decision better
basedplace
on thethe data coordinates.
mapped or
cement problem here. their replicas to avoid overloaded sites in any distributed
systems. 8
D ESIGN OVERVIEW
For the data placement problem in edge computing, data
Placement Problem popularity is also critical. First, data placement aims not
Network Plane
Virtual space
Data Plane p7 = 20
p2 = 7 p15 = 10 p17 = 11
p37 = 18
p27 = 15 p15 = 12
Figure 2.3: Virtual-space-based approach.
Data popularity can be assessed differently for the distributed system depending
on the application. In general, there are three factors contributing to the data pop-
ularity: the number of accesses (i.e., how many times the data item is requested),
the lifetime, and the request distribution over time or space. In this dissertation, we
simply use the number of accesses as the data popularity. However, it is not difficult
to extend our definition to include other two factors (or even other data popularity
measurements) into our system. For each data item di , we assume that its data pop-
ularity pi = p(di ) describes its number of access requests over time. We assume that
data popularity for each data is known to the system. Larger data popularity means
the data item is more frequently accessed by mobile users in the system. Obviously,
the locations of the popular data items are at the roots of the overall data placement
problem, compared with those of unpopular data. Note that there could also be more
complex data popularity models, where various user or location-specific preferences
may be considered differently even for the same data item. Our proposed method
may be further extended to deal with such models by treating the data preferences
from different users/locations with different weights or more refined models. We leave
such study as one of the future works.
9
Control Plane
Determine Data Placement
Data Items Coordinates Decision
(Sec. 4.1) (Sec. 5.1)
Determine Edge Data Retrieve

Network
Server Coordinates Strategy
Topology
(Sec. 4.2) (Sec. 5.2)
measure network topology insert forwarding rules

Switch Plane Forwarding Table
data request Determine Data forwarding data request

Coordinates Forwarding
(Sec. 4.1) (Sec. 5.2)
Figure 2.4: Framework of the proposed data placement solution in a software-defined

edge network infrastructure.
2.2.3 Design Overview

Similar to [123, 122], our popularity-based data placement strategy adopts a
virtual-space approach, which maintains a virtual 2D plane and maps all edge servers
and data items to a such plane, as shown in Fig. 2.3. The data placement is based on
the associated coordinates of edge servers and data items in the virtual plane. How to
perform the mappings is critical in our design. When mapping edge servers into the
virtual plane, we try to make sure the Euclidean distance between two servers in the
plane is proportional to their network distance. When mapping the data items into
the plane, we try to spread them out while taking into consideration of their data
popularity such that the more popular data is closer to the center. The intuition
behind this design is the shortest paths to all other servers are shorter at the center
area. Given the constructed coordinates, we can make our placement decision based
on the virtual distance between a data item and an edge server. The simplest version
is to place the data item on the nearest server in the virtual plane.
Fig. 2.4 illustrates a framework of the proposed methods under a software-defined
edge network infrastructure. In such a system, the switches provide data commu-
nication services to edge servers by following the forwarding rules/entries placed by
the controller in the control plane. At the control plane, virtual coordinates are con-
structed for both data items and edge servers and then a data placement strategy
and its corresponding retrieval strategy will generate the forwarding rules to switches.
At the switch plane, the switch first maps the data request to the virtual plane and
10
forward it based on its virtual coordinate and the installed forwarding rules. To han-
dle load balancing among edge servers and further reduce the data access delay, we
propose several additional placement strategies, which offload data items to nearby
servers when the storage of desired edge server of our placement method exceeds its
maximal limit, and new replica placement strategies, which strategically place multi-
ple replicas to serve users while considering the data popularity to decide the number
of replicas with favor to more popular data.
In summary, in our design, data popularity places a central role. It has been
considered during the construction of virtual coordinates of data items, the selection
of offloading choices, and the decision on the number of replicas to deploy.
2.3 Virtual Coordinate Construction

In this section, we discuss the construction of coordinates for both data and edge
server in the virtual plane, which is a circular region with a radius of 1. The edge
server and data will be mapped to this virtual space, and their coordinates will be
unified to [−1, 1]. The center of the circular region (i.e., o with coordinates (0, 0), as
in Fig. 2.6) represents the center of the network.
2.3.1 Calculating Coordinates of Data Items based on Data

Popularity
Recall that each data di has data size s(di ) and data popularity p(di ). Assume that
each data also has a unique identifier (index or ID) ID(di ). We compute the virtual
coordinate of data by leveraging the Polar coordinate system and hash function. In
the Polar coordinate system, each point is determined by a distance (r) and angle
(θ) from a point and a direction respectively. In terms of the hash function, given a
specific key value with arbitrary size (in our case the data item’s ID), a hash function
can return a fixed-size hash value which we will use for generating the Polar coordinate
of this data item in the virtual space.
Our proposed method to calculate the virtual coordinates of data has three goals.
First, the mapping should be able to spread all data over the virtual plane where the
edge servers will also sit. This can balance the load of data hosting among servers.
11
Average length of shortest paths
50
40 36.89
30 26.37
23.468
20.034
20 19.28
10
0
0-0.2 0.2-0.4 0.4-0.6 0.6-0.8 0.8-1.0
Distance to the network center
(a) network topology (b) shortest path length
Figure 2.5: An example of a physical network topology and the relation of shortest
path length.
Second, the mapping method needs to take data popularity into consideration and
places the popular data items in the location which has smaller shortest paths to other
regions. Last, the mapping method should be deterministic, i.e., given the same data
item, the output of our mapping method should be the same. This can guarantee
that for the data request on the same data item, our retrieval process can lead to the
same location in the virtual plane.
In our solution, we map each data di to a virtual location in the virtual plane whose
polar coordinates are r(di ) and θ(di ). To consider the data popularity in the mapping,
our design is based on the following observation. In a dense network, the center area
has smaller shortest paths to all regions. Fig. 2.5(b) shows that the average length of
shortest paths to other servers and the distance to the network center for each server
in a randomly deployed network with 50 servers (Fig. 2.5(a)). The servers closer to
the center of the network have a less total length of all shortest paths. Based on this
observation, our mapping method puts a popular data item near the center of the
virtual plane. Specifically, we generate r(di ) ∈ [0, 1] using
r(di ) = 1 − p(di )/pmax , (2.1)
where pmax is the maximal data popularity among all data items. By doing so, the
more popular data is, the closer to the center point as shown in Fig. 2.6(a). To spread
data items at different regions, we calculate the angle θ(di ) using the hash value of
the data’s ID. Particularly, we first calculate the hash value H(di ) by using a hash
function H (e.g. SHA-256). Next, we reduce the hash value to the scope of the virtual
12
y
d1 x(d1)
r(d1) d1
!(d1) y(d1)
!(d2) x
o o
y(d2)
d2
r(d2)
d2
x(d2)
(a) (b)
Figure 2.6: Illustration of the virtual coordinates of two data items in the virtual
plane.
space by (1) using only the last 4 bytes of H(di ) and converting them to a 4-byte
binary value h(di ), and (2) normalizing h(di ) between 0 and 2π. In other words,
θ(di ) = 2π × h(di )/(232 − 1). (2.2)
By doing so, we place this data item along a certain direction in Polar coordinates.
Different data items will be spread all different directions. Even the data items with
the same data popularity will be placed at different locations. The final polar coor-
dinates are (r(di ), θ(di )), whose corresponding Cartesian coordinates can be obtained
by 
x(di ) = r(di ) × cos θ(di )
(2.3)
y(d ) = r(d ) × sin θ(d ).
i i i
Fig. 2.6 illustrates the relationship of virtual coordinates between Polar and Cartesian
coordinates. All data items are mapped into a circular region with a unit radius in
the virtual plane. The construction of coordinates for all data items can be done in
O(W ), where W is the number of data items.
2.3.2 Calculating Coordinates of Edge Servers based on Net-

work Distance
We also want to spread all edge servers in the same virtual plane. The major
goal of the mapping of edge servers is to make sure that the Euclidean distance
between two edge servers in the virtual plane is proportional to their physical network
13
distance. By doing so, when we place popular data items near the center in the
virtual plane, the accessing cost of them will be relevantly smaller since the cost is
proportional to the distance in the virtual plane. In addition, this will ensure the local
retrieve (which picks the next server based on their virtual coordinates) has a low
routing stretch. This mapping problem is a network embedding (or graph embedding)
problem, which has been well studied. Given the network topology G and the shortest
path measurements among edge servers, we adopt the M-position algorithm used by
[123, 122] to generate the virtual coordinates of edge servers in the 2D virtual plane.
For completeness, we briefly review the basic idea of such an algorithm. Given the
network topology G, we can obtain the shortest path matrix L = {lij }, where lij is
the shortest path length from edge server i to j. Using L as the input, the M-position
algorithm aims to calculate the coordinates of edge servers, which can be represented
as a coordinate matrix Q (a 2 × N matrix of N edge servers in the two-dimensional
virtual plane), i.e.,
" #
x(v1 ) x(v2 ) · · · x(vN )
Q= .
y(v1 ) y(v2 ) · · · y(vN )
The key idea behind the M-position algorithm is based on the fact that Q can be
derived from a scalar product matrix B = 21 JL(2) J via the eigenvalue decomposition
[123]. The major steps of the mapping algorithm to generate coordinates of edge
servers are given as follows.
1. Given the network topology G, generate the shortest path matrix L = {lij } and
compute L(2) = {lij
2
}, which is the squared distance matrix.
2. Compute the scalar product matrix B = 12 JL(2) J, where J = 1 − 1

N
A and A is
an N × N matrix of ones.
3. Determine two largest eigenvalues λ1 , λ2 and the corresponding eigenvectors ξ1 ,

ξ2 of matrix B.
1/2
4. Construct the coordinates of edge servers Q = Ξ2 Λ2 , where Λ2 is the matrix
of two eigenvalues and Ξ2 is the matrix of two eigenvectors of the matrix B.
5. Normalize Q to √ 1 Q, where qmax is the largest absolute value of all elements

2qmax
in Q, so that all coordinates of edge servers are within the circular region with
unit radius in the virtual plane.
14
The construction of coordinates for all edge servers takes O(N 3 ), which is dom-
inated by the complexity of all-pairs shortest path and eigen decomposition of the
matrix.
2.4 Data Placement and Retrieve

In this section, we discuss how our data placement strategy places and retrieves
data based on virtual coordinates.
2.4.1 Placing Data to Edge Servers

Since we have obtained the coordinates of both edge servers and data items in the
same region on the virtual plane, the data placement becomes quite straightforward.
Here, we first assume that each edge server has sufficient storage to hold the data
placed by our placement strategy. We will discuss how to handle load balancing when
there is a storage limit at the edge server later. Our proposed basic data placement
strategy (denoted by Basic Placement) places each data item di to the edge server
that has the nearest Euclidean distance to this data item in the virtual plane, i.e.,
f (di ) = arg minvk ||vk , di ||

p
= arg minvk (x(vk ) − x(di ))2 + (y(vk ) − y(di ))2 .
Fig. 2.7 shows examples of data placement output. The assignment of servers forms a
Voronoi diagram (Fig. 2.7(a)) in the region, where if a data item falls within a Voronoi
cell then it will be placed at the edge server that owns the Voronoi cell. Since the
more popular data items are more towards the center of the network (as shown in
Fig. 2.7(b)), they will be placed on edge servers whose shortest paths to other servers
are shorter. It is noted that the center of the network is relative to the virtual plane,
which is the center of the circular region. The popular data items are placed on the
servers near the center of the network, which is determined based on the distances
on this virtual plane. The data placement decision is made within O(W N ) where N
and W are the numbers of edge servers and data items, respectively.
15
d45
v34 v4
d27
d16
d87 v2
v47 v12 v46 d7
v8
d32 d34 d86
(a) Voronoi diagram of all (b) a zoom-in view of data

edge servers in virtual plane placement in Voronoi cells
Figure 2.7: Illustration of basic data placement.
2.4.2 Retrieving Data from Edge Servers

With our proposed coordinate construction and data placement, there are two
possible retrieve strategies to retrieve data from the edge network: global retrieve
and local retrieve.
Global Retrieve: The global retrieve method assumes that the edge server and
network controller have global knowledge of the topology. Based on this knowl-
edge, the shortest paths between any pair of servers are calculated and corresponding
switching information will be deployed in underlying switches. When a mobile user at
an edge unit requests a data item di , it first determines the coordinate (x(di ), y(di ))
of this data in a virtual plane based on its index and popularity. Then based on
our placement strategy, it can calculate the target server which holds the data item,
i.e., vj = f (di ). Next, the data request is routed towards vj by following the stored
shortest path until it reaches the target server. Global retrieve strategy not only can
guarantee the delivery of data requests but also has minimal retrieval latency since
it relays the request over the shortest path. However, the drawback is the additional
storage overhead at switches and edge servers to store the shortest paths to all edge
servers.
Local Retrieve: In contrast, the local retrieve method assumes that each edge
server only knows the information from its neighboring edge servers. Thus it enjoys
lower storage at servers and switches. However, the challenge is how to route the
request to the target server. Note that instead of finding the target server f (di )
16
vt=f(d1) vt =f(d1) vt =f(d1)
d1 d1 va d1
vs vs vs
vb
(a) (b) (c)
Figure 2.8: Illustration of global retrieve vs local retrieve.
directly (which needs the knowledge of coordinates of all servers), in local retrieve,
the data request is routed towards the coordinate (x(di ), y(di )) of the data item in the
virtual plane. At each server when receiving the request, it first checks whether this
data is placed in itself. If the current edge server is the target server, it will reply to
the request. Otherwise, the current server greedily selects the next server to forward
from its neighboring server based on its coordinates. The criteria are to pick up the
server whose coordinate is nearest to the target coordinate (x(di ), y(di )) in the virtual
plane. This procedure repeats until the request reaches the target server. Notice that
greedy forwarding may fail at the local minimum, but randomized forwarding can be
used to get out of the local minimum2 . However, compared with the global retrieve,
local retrieve causes longer retrieval latency which is due to (1) a longer exploration
process to find the target server and (2) a longer founded delivery path between source
and target servers.
Fig. 2.8 illustrates the difference between these two retrieve methods. There is
a trade-off between the performance (retrieve latency) and the complexity (comput-
ing, storing, and updating the global shortest paths). Our proposed data placement
method supports both retrieval methods and can select the appropriate one based on
different application scenarios.
2
There are other methods to eliminate the local minimum for greedy routing via adjusting trans-
mission range [112] or building a Delaunay graph [111, 42].
17
2.5 Data Placement with Limited Storage
We have introduced the basic data placement strategy based on data popularity
in the previous section. However, we did not consider the storage capacity at the
edge server yet. Based on the basic placement strategy some of the edge servers may
be overloaded with data items. If each edge server vi has specific maximal storage
capacity ci = c(vi ) such that it can only store data items whose total size is up to
ci . Hereafter, we use cc(vi ) to denote the current storage usage of server vi . In this
section, we propose some simple heuristics to handle load balancing3 . due to storage
limits. All the heuristics use the output (i.e., f (di )) of our basic data placement
as their input, but they are different from each other in (1) the ordering of data
placement and (2) the choice of offloading server. Here, this offloading is just an
additional step during the making of data placement decisions. The real placement
of data items on servers happened after the final data placement decision. In addition,
due to the offloading, data retrieval also needs to be able to find the offloading server.
2.5.1 Processing Order for Data Placement

After the basic data placement based on data popularity, we have an initial place-
ment decision, which can be denoted as a list place dec which consists of multiple
two-tuple (di , vj = f1 (di )), where di is the data index and vj is the edge server as-
signed by the basic data placement for di . However, some of the edge servers may
not have enough storage to hold all assigned data items, therefore, we need to make
an offloading decision to assign some of their data items. However, which item from
which server to offload is tricky since moving an assignment of a data item to a server
may cause a chain effort on other servers. Such effort is also decided by the processing
order of data placement with offloading. Here, we propose two different methods to
process the data placement with offloading consideration.
Placing Data in the Order of Popularity The first method processes the
data item based on their popularity to confirm the data item placement or offload it
to other servers. It first sorts the resulting list place dec in descending order based
on data popularity p(di ). Next, for each item (di , vj = f1 (di )) in place dec, if the
3
Later, we also consider using multiple replicas to further balance loads among servers. In
addition, there are other possible load-balancing strategies for edge computing [136, 7, 127, 25]
18
Algorithm 1 Data Placement in Order of Data Popularity
Input: The placement decision place dec = {(di , f1 (di ))} determined by Ba-
sic Placement.
Output: The updated new placement decision place dec.
1: Sort place dec in descending order of popularity p(di );
2: for each item = (di , vj ) in place dec do
3: if s(di ) + cc(vj ) > c(vj ) then
4: vl = Find Offloading Server(di , vj );
5: Update and confirm item = (di , vl ), i.e.,f2 (di ) = vl ;
6: cc(vl ) += s(di );
7: else
8: Confirm item = (di , vj ), i.e.,f2 (di ) = vj ;
9: cc(vj ) += s(di );
10: return place dec = {(di , f2 (di ))}
summation of data size s(di ) of di and the current server storage cc(vj ) of vj does
not exceed the maximal server storage c(vj ), then we confirm this placement and
place this data di to this edge server vj . Otherwise, we find an available edge server
vl to offload this data item (denoted by a procedure Find Offloading Server) and
modify the initial placement decision f2 (di ) = vl . Multiple ways to find such an edge
server to offload will be discussed in the next subsection. The detailed algorithm is
presented in Algorithm 1. By performing this algorithm, we can guarantee that each
server has sufficient storage to hold all assigned data items and avoid the overloading
of certain edge servers. The total time complexity of Algorithm 1 is O(W log W +
W X), where O(W log W ) is from ordering the data popularity and O(W X) is for
W rounds of finding an offloading server for each data item. Here, X is the cost
of Find Offloading Server, which is bounded by the number of neighbors of the
server vi or by N depending on which method is used.
Placing Data in the Order of Server Capacity The second method processes
the data placement in a different order which is based on the maximal edge server
storage capacity. The idea of this strategy is to deal with the edge server that has
a bigger maximal storage capacity first. When determining which data should be
placed on the current edge server and which should be offloaded, we continue to
19
Algorithm 2 Data Placement in Order of Server Capacity
Input: The placement decision place dec = {(di , f1 (di ))} determined by Ba-
sic Placement.
Output: The updated new placement decision place dec.
1: Sort V in descending order of server capacity c(vi );
2: for each vi in V do
3: Generate Di based on place dec;
4: Sort Di in descending order of data popularity p(dk );
5: for each dk in Di do
6: if s(dk ) + cc(vi ) > c(vi ) then
7: vl = Find Offloading Server(dk , vi );
8: Update/confirm dk ’s placement, i.e.,f3 (dk ) = vl ;
9: cc(vl ) += s(dk );
10: else
11: Confirm dk ’s placement, i.e.,f3 (dk ) = vi ;
12: cc(vi ) += s(dk );
13: return place dec = {(dk , f3 (dk ))}
take into account data popularity, where more popular data is easier to stay at the
current server. The algorithm acts as follows. First, we sort the list of edge servers
V in descending order according to the maximal edge server storage capacity c(vi ),
such that c(v1 ) ≥ c(v2 ), · · · , ≥ c(vN ). For each server vi , we define Di as a list
that consists of all data items assigned to vi by the basic data placement, i.e., Di =
{dk |vi = f1 (dk )}. Then we process the edge server to confirm or update the data
placement on that server. For each server vi , Di is sorted based on data popularity
in descending order. We process the data item dk ∈ Di . If placing this item does
not exceed the maximal storage of vi , i.e., s(dk ) + cc(vi ) ≤ c(vi ), its placement is
confirmed. Otherwise, we simply call the procedure Find Offloading Server to
find a near server to place it and update its placement f3 (dk ). The whole process
is repeated for all data items on all servers. The detailed algorithm is presented
in Algorithm 2. The major difference with the first method is that the processing
order is based on server capacity (the outer “for” loop in Algorithm 2). The total
time complexity of Algorithm 2 is O(N log N + N
P
i=1 (|Di | log |Di | + |Di |X)). Here,
20
cc(vf )=0 cc(ve)=2 cc(vf )=0 cc(ve)=2 cc(vf )=2 cc(ve)=2
v v v
c(vf )=5 f ve c(ve)=4 c(vf )=5 f ve c(ve)=4 c(vf )=5 f ve c(ve)=4
s(di )=2 di s(di )=2 di s(di )=2 di

cc(vd)=1 cc(vd)=3 cc(vd)=1
cc(vj )=4 v vd c(vd)=3 cc(vj )=4 v vd c(vd)=3 cc(vj )=4 vj vd c(vd)=3
j j
c(vj )=5 c(vj )=5 c(vj )=5
va cc(vc)=2 va cc(vc)=2 va cc(vc)=2

vc vc vc
cc(va)=2 vb c(vc)=6 cc(va)=2 vb c(vc)=6 cc(va)=2 vb c(vc)=6
c(va)=3 cc(vb)=5 c(va)=3 cc(vb)=5 c(va)=3 cc(vb)=5
c(vb)=5 c(vb)=5 c(vb)=5
(a) (b) (c)
Figure 2.9: Illustration of two methods for Find Offloading Server.
O(N log N ) is from ordering the server capacity, O(|Di | log |Di |) is from ordering the
data popularity in Di , and O(|Di |X) is |Di | rounds of find offloading server for each
data item in Di . Note that N
P
i=1 |Di | = W .
2.5.2 Offloading Choice

Now we discuss the two possible methods to implement the procedure to find the
offloading server vl for data item di at vj (since vj does not have sufficient storage).
The first method (Nearest Neighbor) simply finds the offloading server from vj ’s
neighboring edge servers in topology G. The selection criteria are (1) vl ∈ {vi vl ∈ E};
(2) s(di ) + cc(vl ) ≤ c(vl ); and (3) ||vl , vi || in the virtual plane is the minimum among
all candidates satisfying (1) and (2). Note that if none of the neighbors has sufficient
storage, we search for neighbors’ neighbors instead. This repeats until we can find
a server to host di . Nearest Neighbor aims to find a nearby server from vj in the
network topology to hold the data.
The second method (Nearest Server) relaxes the requirement of vl to be a neigh-
bor of vj , instead considering all possible servers in the region. It finds the offloading
server based on (1) s(di ) + cc(vl ) ≤ c(vl ) and (2) ||vl , di || in the virtual plane is the
minimum among all candidates satisfying (1). I.e., it picks the server which has
enough storage and a minimal distance to di instead. Nearest Server aims to find
a server near the data item di in the virtual plane to hold it. Both methods try to
offload di to a server nearby. The time X used to choose offloading server is bounded
by the number of neighbors of the server vi in G for Nearest Neighbor or by N for
Nearest Server. Fig. 2.9 illustrates the difference between these two methods.
21
2.5.3 Data Retrieve
Since the proposed methods might offload data items from the original assigned
server by vj = f1 (di ) to another server, saying vl = f2 (di ) or f3 (di ), we need to have
a way to let data retrieve method to find the new server. Note that even though
the network controller may know the location of the new server, each individual
server when receiving the data item may not know the global information of server
capability, thus failing to find the offloading decision. Instead of broadcasting all
offloading decisions to the whole network, a simple solution is to let the original
server vj host a forwarding entry to record the path towards the new server vl . By
doing so, the data retrieval methods can stay the same and the data request of di is
still forwarded towards vj . When it reaches vj , it is then further forwarded to vl . This
will cost additional path length and retrieval latency. However, since the offloading
server is selected to minimize the distance between vj and vl , the additional cost is
minimized.
2.6 Data Placement with Multiple Replicas

While our proposed offloading methods can balance the data storage loads based
on the maximal server storage, it does not solve the problem of overloading data
requests to the servers. To address this problem, in this section, we propose a new
data placement strategy to balance data request loads by leveraging replication. Data
replication is the management of multiple replicas of the same data in the system.
Replication strategies have been widely used in distributed systems [59, 4, 115, 62,
88, 12, 93] to ensure load balancing, reliability, and data transfer speed as well as to
offer the possibility to access the data efficiently from multiple locations.
In our data placement problem, if we allow multiple replicas of the same data
distributed on different edge servers, it can not only provide load balancing but also
improve the response time of data requests and the delay of data delivery. However,
how to design the replication strategy becomes crucial. Generally, different data
characteristics and system conditions influence the replication strategy. The key
challenge of implementing an effective replication strategy consists of two metrics.
First, how to determine the number of replicas of each data item? Second, how to
22
choose the edge server to place these replicas? Next, we answer these two questions
separately for our data placement design.
2.6.1 Number of Replicas

Inspired by [12], to efficiently determine the number of replicas, we take data size,
data popularity, and the remaining storage capacity of all edge servers into account.
The general ideas are (1) more replicas will be given to larger and more popular data
items, and (2) more network storage available also lead to more replicas. For each
of these aspects (i.e., data size, data popularity, and available storage), we normalize
it with its maximal value in the network. Therefore, the number of replicas for each
data di is calculated as follows.
P
s(di ) p(di ) j (c(vj )− cc(vj ))
n(di ) = (α1 + α2 + α3 P ) × β × N. (2.4)
smax pmax j c(vj )
Recall that N is the number of edge servers, smax and pmax are the maximal data
size and data popularity. cc(vj ) and c(vj ) are the current used storage and maximal
storage limit of vj , thus c(vj ) − cc(vj ) is the remaining storage capacity of vj . α1 , α2 ,
and α3 are weights added to these three coefficients since the relative importance of
the three aspects can vary based on the data characteristics and system conditions.
While α1 + α2 + α3 = 1, they can be adjusted to meet different requirements. For
instance, with increasing data popularity, we can use higher α2 to increase the number
of replicas for the more popular data. Similarly, if our edge network system has limited
storage capacity or low bandwidth, we may decrease α3 or α1 to meet the requirement.
β is a ratio parameter to control the total number of replicas with respect to the total
number of N . Larger β leads to more total replicas.
2.6.2 Placing Replicas

After we determine the number of replicas n(di ) of each data di , we have to
determine where to place these replicas. The placement strategy of replicas should
have the following goals. First, it should spread the replicas as evenly as possible
in the region to balance the load. Second, the placement should still be popularity-
aware, such that popular data has shorter shortest paths. Last, the mapping and
placement methods should be non-randomized so that retrieval can be easily done. To
23
2
di
1
di
r(d1i ) 1
1 !(di )
!(di ) d
2
i
o o
d
1
i 3 r(d2i )
r(di ) di
r(d3i ) 3
di
(a) (b)
Figure 2.10: Illustration of the virtual coordinates of data replicas generated by our
method.
achieve all of these, we modify our mapping method which generates the coordinates
of data items to map a data item to n(di ) locations in the virtual plane based on its
popularity. Both the placement and retrieve methods can still be the same, where
each replica is just placed on the nearest server in the virtual plane, and data requests
are forwarded to that server during retrieving.
Calculating Coordinates of Replicas. For each data item di , we will generate
n(di )
n(di ) data items, denoted by d1i , d1i , · · · , di , and spread them on the virtual plane.
Inspired by the Voronoi diagram, we spread all replicas using the radius which depends
on the data popularity but different angles. As shown in Fig. 2.10(a), the Voronoi
diagram formed by all replicas is evenly distributed in the virtual plane. The polar
coordinates of k-th replica dki in the virtual plane are given by the following equations.
p(di )
r(dki ) = 1 − pmax
,
(2.5)
θ(dki ) = 2π 2h(d i)
32 −1 +
2π
k n(d i)
.
Note that the first copy of data item d1i is mapped to the same location as di . The
radius and angle are still deterministically decided by data popularity and data index.
2π
The other replicas are evenly distributed by an angle difference of n(di )
with the same
radius. This solution seems to achieve all desired goals, but it may have a problem
when the data item is very popular. In that case, the radius is small, all replicas will
be placed around the center of the network. Though their Voronoi cells are equal,
this is not ideal since multiple replicas will be nearby. Therefore, we further modify
the mapping method, by defining a threshold τ < 0.5. If r(dki ) < τ , we shift r(dki ) by
24
adding 0.5, except for the first copy of the data which is still at the original location.
Fig. 2.10(b) shows such an example and the Voronoi cells of all replicas. Then, the
new mapping method is given as follows.
p(di )
1.5 − pmax
, 1 − pp(di)
< τ and k > 1
r(dki ) = { max
,
1 − pp(d i)
max
, otherwise (2.6)
θ(dki ) = 2π 2h(d i) 2π
32 −1) + k n(d ) .
i
In our simulations, we use τ = 0.01. After we have the coordinates of all replicas,
we can place these replicas on the closest edge server in virtual space. The retrieval
procedure is straightforward too.
Our new placement method with replicas makes sure that (1) the more popular
data is, at least one copy of the data is closer to the center; (2) different data replicas
are well spread in the virtual plane; (3) the shortest paths are reduced since copies
of data can be found at multiple locations.
2.7 Evaluation
In this section, we report the results from our simulations to evaluate our proposed
data placement strategies.
2.7.1 Simulation Setup

To test our proposed data placement strategies, we randomly construct a network
typology G with 50 edge servers whose degree satisfies a binomial distribution. The
cost of each network link is set randomly from 1 to 20. Fig. 2.5(a) shows an example
of such network topology. Since edge servers are not normal servers with enough
capacity, so we assume each edge server has a different limited maximal storage
capacity ranging from 500MB to 1, 000MB. In terms of the data set, we randomly
generate 100 data items with data sizes from 100MB to 150MB. To simulate the data
popularity of each item, we leverage a real-world news popularity dataset [71] provided
by the University of Porto and randomly draw data popularity from it. Based on the
simulated data popularity, we randomly generate data requests for different data items
at random edge servers either in the whole region or at the boundary of the network.
Based on each data placement strategy, we place data items on corresponding edge
25
server(s) and then perform data retrieves of all data items randomly from all edge
servers based on their data popularity. Mainly, three performance metrics are used
to evaluate the performance of the proposed methods:
• Average path length. It is the average length of forwarding path of data

requests from the source edge server to the target server during the data retrieval
process. The length of a path is the summation of the link costs of all links on
the path, which reflects the end-to-end networking delay. Obviously, the shorter
the average path length the better the data retrieval performance.
• Average retrieval latency. It is defined as the average running time of re-

trieval strategy during the data retrieval process. This mainly quantifies the
complexity of the retrieval strategies. Here, latency is not due to the network-
ing delay.
• Distribution of the number of data items. To measure the load balancing

of data placement, we also report the distribution of the number of data items
on each edge server. The goal is to minimize the largest number of data items
on a single edge server.
We test seven different versions of our data placement strategies in the simulations,
and they are
• OUR-B. This is the basic data placement strategy (Basic Placement). It

does not consider the storage limit at each server and the placement decision is
purely based on virtual coordinates.
• OUR-S. This set of four data placement strategies where data items exceeding
the maximal storage limit at the assigned server by OUR-B will be offloaded to
nearby servers. Since we have two methods for processing offloading in different
orders (Algorithm 1 and Algorithm 2) and two methods for different offload-
ing choices (Nearest Neighbor and Nearest Server), we have four different
OUR-S methods in total:
– OUR-S1: Algorithm 1 + Nearest Neighbor,
– OUR-S2: Algorithm 1 + Nearest Server,
26
– OUR-S3: Algorithm 2 + Nearest Neighbor,
– OUR-S4: Algorithm 2 + Nearest Server.
• OUR-R. This is the data placement strategy that places multiple replicas in
the network and the number of replicas n(di ) of a data item di depends on its
popularity, size, and available storage. For comparison, we also implement and
test another version OUR-R-fixed where fixed numbers of replicas are used.
For all of these data placement strategies, both global retrieve and local retrieve can
be used. At last, the simulation runs 100 times to get the average result.
2.7.2 Comparison with Existing Methods

First, we compare the performance of our proposed data placement methods with
existing methods from [123] and [122]. Recall that both COIN [123] and GRED
[122] also use virtual-space-based methods to place data indexes or items but did
not consider the data popularity, and their work inspires our proposed methods.
Compared with COIN, GRED uses the centroidal Voronoi tesselation to handle load
balance among edge servers. In this set of simulations, we compare our methods
OUR-B and OUR-S1 to COIN and GRED with the same setting. The number of
data requests is set to 1, 500 and global retrieve is applied. Furthermore, in this
subsection, we only compare methods without replicas since methods with replicas
always have better performance. The results are reported in Fig. 2.11.
The left plot shows the average path length when the data requests come from
random edge servers. When the data storage limit at each server is not considered,
our proposed basic method OUR-B performs similarly to COIN with a slightly shorter
average path length. But when we consider the storage limit, our method OUR-S1
can significantly reduce the path length compared with GRED. This confirms the
advantage of considering data popularity during both data placement and offloading.
Compared with OUR-B, OUR-S1 only increases the average path length a little (while
GRED’s path is much longer than COIN). Recall that offloading data to other servers
will increase the retrieve paths.
The right plot shows the results when the data requests come from the edge servers
near the boundary of the network. Since these requests need a longer path to reach
27
60
54.542
50
Average accessing cost
Average accessing cost

40 39.798
34.057 34.136 36.1
30 25.991
21.824
20 20.13319.78917.817
10
0
COIN OUR-B OPT GREDOUR-S1 COIN OUR-B OPT GREDOUR-S1
Placement Strategy Placement Strategy
Figure 2.11: Comparison with existing data placement methods.
other parts of the network, the path lengths of all methods are longer than those in
the left plot. In addition, we can also observe that our proposed algorithms (OUR-B
or OUR-S1) are much better than the existing methods (COIN or GRED) in this
case. This is mainly due to that our proposed methods consider data popularity and
ensure that the more popular data is closer to the network center where the average
path length to the boundary region is much shorter.
Obviously, without considering the data storage limit, the average accessing cost
of the OPT is better than all methods. However, such an optimal method will in-
crease the storage burden of the selected server, because all data items are placed
on the single optimal server. On the contract, our proposed methods spread all data
to different servers based on their data popularity and virtual distances. Thus, our
methods can balance the storage burden among edge servers while keeping relevantly
small accessing costs as shown in Fig. 2.11. In addition, if the storage limit and/or
request distribution are considered, finding the optimal server becomes a very chal-
lenging optimization problem.
2.7.3 Global Retrieve vs Local Retrieve

We now compare two retrieve strategies: global retrieve and local retrieve. First,
we use OUR-B as the placement strategy and vary the total requests from 100 to
2, 000. Fig. 2.12(a) shows the average length of the path traveled by data requests.
28
Average retrieval latency (ms)
50 2.00
Global 1.75 Global
Average path length

40 37.463 36.994
Local
36.863
Local
35.824
33.538 1.50
1.25 1.218 1.214
30 1.12
1.026
1.151
23.17 22.43 21.355 22.104 22.289 1.00
20 0.75
10 0.50
0.229 0.279 0.253 0.23
0.25 0.202
0 0.00
100 200 400 800 2000 100 200 400 800 2000
Number of requests Number of requests
(a) average path length (b) average retrieve latency
Figure 2.12: Comparison of global retrieve and local retrieve in OUR-B.
Note that the average path length of different numbers of requests is almost the same
for both retrieve methods. This is reasonable since the data placement is static.
More importantly, the global retrieve strategy has a shorter average path length than
the local retrieve strategy. This is due to that the global strategy always takes the
shortest path between the source and target servers. Fig. 2.12(b) presents the average
retrieve latency of two retrieve methods. As we can see, the average retrieve latency
of the local strategy is far larger than that of the global strategy. This is due to that
(1) the local retrieve strategy takes more hops during the forwarding; (2) it also may
need to perform random forwarding to escape from local minimums.
Second, we fixed the number of requests at 1, 000 and test the two retrieve strate-
gies with both OUR-B and four OUR-S placement methods. Fig. 2.13 presents the
results. From the results, we can draw the following conclusions. (1) Similar to results
in Fig. 2.12, local retrieve takes a longer path and latency than global retrieve does
in all cases. (2) For the average path length in global retrieve, it is clear that those
of OUR-S is longer than that of OUR-B. Remember that in OUR-S data items may
be offloaded to another edge server rather than the nearest server to the data. Thus,
global retrieve needs an additional path to reach the target server. (3) For the local
retrieve, the average path lengths of OUR-S are shorter than that of OUR-B. This
might be due to some of the data items being offloaded to servers that are closer to
the request edge server. (4) It is obvious that the average retrieve latency of OUR-S
is longer than that of OUR-B for both global and local retrieve since the retrieve
takes additional time to find where the data is stored. Interestingly, even though the
average path length of local retrieve for OUR-S is shorter than that for OUR-B, local
29
Average retrieval latency (ms)
50 2.5
Global Global2.191 2.174 2.181 2.21
Average path length

40 38.093 37.217 37.165 37.32 Local
37.858 2.0 Local
30 1.5
22.383 22.369 23.497 23.813 1.12
21.869
20 1.0
10 0.5
0.209 0.245 0.244 0.255 0.243
0 0.0
OUR-B OUR-S1 OUR-S2 OUR-S3 OUR-S4 OUR-B OUR-S1 OUR-S2 OUR-S3 OUR-S4
Placement Strategy Placement Strategy
(a) average path length (b) average retrieve latency
Figure 2.13: Comparison of global retrieve and local retrieve in OUR-B and OUR-S.
retrieve still needs more time to figure out the location of the data item thus leading
to a much longer retrieve latency than OUR-B.
In summary, local strategy in general leads to longer average path length and larger
retrieve latency than global retrieve. However, the local strategy only utilizes the
neighbors’ information to compute the forwarding decision, which saves the storage
of all shortest paths and makes it work well in scale. Therefore, there is a trade-off
between the performance (retrieve latency) and the complexity (computing, storing,
and updating the global shortest paths). For all simulation results in the remaining
section, we only report the results from global retrieve due to space limitations.
2.7.4 Placement Strategies with Storage Limits

In this set of simulations, we aim to illustrate the power of offloading in data
placement when we consider the maximal storage limits. Fig. 2.14 shows the dis-
tribution of the number of data items placed at each server for both OUR-B and
OUR-S. The top subplot in Fig. 2.14(a) presents the detailed load of OUR-B at each
server which ignores the storage limit. Clearly, the edge server 32 is overloaded with
data items. The other four subplots are results from four OUR-S, obviously, they can
spread the overloaded data items to other edge servers to make the load more evenly.
Fig. 2.14(b) also shows the aggregated distribution of the number of servers over the
number of placed data items. The same conclusion can be drawn that OUR-S reduces
the number of servers with high loads (such as those with more than 6 data items in
OUR-B).
30
OUR-B 30 OUR-B
25 OUR-S1
Number of servers
OUR-S1 OUR-S2
20 OUR-S3
OUR-S4
OUR-S2 15
10
OUR-S3
5
OUR-S4
0
0-1 2-3 4-5 6-7 8-9 10-11 >11
Range of number of data
(a) loads among all servers (b) server load distribution
Figure 2.14: Distribution of placed data items among servers for OUR-B and four
OUR-S strategies.
2.7.5 Placement Strategies with Data Replicas

Finally, we evaluate our proposed placement strategy OUR-R with data replica-
tion. Here, we test the following four data placement methods:
• OUR-B, the basic placement strategy where only a single copy of the data
item is placed.
• OUR-R, the proposed placement strategy with multiple replicas, where the
number of replicas is calculated based on data popularity, data size, and avail-
able storage.
• Random-R-fixed, a random placement strategy with a fixed number of repli-

cas, where a fixed number of replicas of each item are randomly placed at edge
servers.
• OUR-R-fixed, a variation of the proposed placement strategy with multiple

replicas, where the number of replicas of each item is fixed.
To treat all methods with multiple replicas fairly, we let the fixed number of replicas in
Random-R-fixed/OUR-R-fixed be equal to the average number of replicas on OUR-R.
First, we display the loads among all servers with multiple data replicas as shown
in Fig. 2.15. As we can see, there are more data in each edge server compared with
the single replica strategy OUR-B. In addition, the difference in loads between these
three multiple replica strategies is not obvious due to data duplication. However, the
average path length is different as shown in the next figure.
31
OUR-B 12
10
# of data
7 8 7 8
4 4 3 4 534 5
0202 21300010 0 1 10 1000 3
0122 000 1 0 00 0 000
3
0
1111 12 1012 11 10 1011 9 9 9 10 9 9 11
8 9 6 10 8 10
OUR-R
10 79 799 75569
# of data
54 768 5 6 7 8 76
5
86
5
4 3 3
0
9 7 10 10 10 10
Random-R-fixed
# of data 10 9 97 99 9 7 8 6685555 7 9 98
533 665 6 4634 64 4 4 66 5 4
3 2 3 1 2 3
0
13 15 13 13
11
OUR-R-fixed
9 9 1110 10 9 9 8 1110 13 9 1010 8
10
# of data
68 757764 56 77 5 3
8 7
45
7 7 657 8
4 65 666
0
0 1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334353637383940414243444546474849
Server index
Figure 2.15: Loads among servers with multiple data replicas.
30.0 45
27.5 OUR-B OUR-B
Average path length
Average path length

OUR-R 40 OUR-R
25.0 Random-R-fixed Random-R-fixed
22.5 OUR-R-fixed 35 OUR-R-fixed
20.0
17.5 30
15.0 25
12.5
10.0 20
500 1000 1500 2000 500 1000 1500 2000
Number of requests Number of requests
(a) global retrieve (b) local retrieve
Figure 2.16: Comparison of placement strategies with multiple replicas.
Fig. 2.16 shows the results of all methods under either global retrieve or local
retrieve. First, the average path length of local retrieve is longer than that of global
retrieve, which is the same conclusion from previous simulations. Second, all methods
with multiple replicas perform much better than the single replica does. Compared
to the single replica strategy, OUR-R reduces the average path length up to 36%.
This confirms that data replication can significantly reduce the average path length
of data requests. Third, both with multiple replicas, OUR-R performs better than
OUR-R-fixed, This shows the advance of using carefully designed number replica
estimation with data popularity over fixed number replicas (evenly distributed among
data items). Fourth, In global retrieve, OUR-R-fixed performs better than Random-
R-fixed, which shows the proposed placement with the Voronoi diagram is much better
than random placement. However, OUR-R-fixed performs worse than Random-R-
fixed in local retrieve since the Random-R-fixed method may place multiple replicas
32
near the request server randomly. In summary, among all methods with multiple
replicas, our proposed OUR-R has the best results.
2.8 Literature Review

Data placement has been an important topic in distributed database/system
[17, 1, 13], peer-to-peer networking[59, 103, 100, 88], content delivery network[19, 4],
and cloud computing[49, 125, 26, 109, 141, 107, 133]. Due to the similarity of cloud
computing and edge computing, here, we mainly focus on reviewing recent data place-
ment strategies in cloud computing. Li et al. [49] proposed a clustering algorithm
based on the principle of minimum distance to dynamically place data into clusters
and a consistent hashing method to decide the specific storage servers to hold the
data in cloud computing. However, the proposed methods only consider the simi-
larity among data as the distance metric in clustering but ignore all networking and
computing delay among servers. Xu et al. [125] investigated the data placement
problem among data centers via cloud computing, and proposed a genetic algorithm
to obtain the best approximation of data placement. Similarly, Guo et al. [26] also
used a genetic algorithm to solve the data placement among data centers where both
the cooperation costs among data slices and the global load balancing are considered.
Wang et al. [109] also studied data placement strategy for data-intensive computa-
tions in distributed cloud systems, which aims to minimize the total data scheduling
between data centers while maintaining statistical I/O load balancing and capacity
load balancing. In addition, several works [133, 107, 141] have focused on the data
placement strategy for scientific workflows in a cloud environment where data depen-
dency and temporal relation among data and tasks play important roles. However,
these centralized data placement methods suffer from high computation and commu-
nication overheads.
While similar to cloud computing, edge computing has its own characteristics [95].
Thus the data placement problem in edge computing becomes a new emerging topic
in recent years. Both Shao et al. [93] and Lin et al. [55] have studied the data
placement strategy for workflows in edge computing. Shao et al. [93] proposed a data
replica placement strategy for processing the data-intensive IoT workflows which aims
to minimize the data access costs while meeting the workflow’s deadline constraint.
33
The problem is modeled as a 0–1 integer programming problem to consider the data
dependency, data reliability, and user cooperation, and then solved by an intelligent
swarm optimization. Similarly, Lin et al. [55] also proposed a self-adaptive discrete
particle swarm optimization algorithm to optimize the data transmission time when
placing data for a scientific workflow. Li et al. [45] investigated a joint optimization
of data placement and task scheduling in edge computing to reduce the computation
delay and response time. For the data placement optimization, the authors considered
the value, transmission cost, and replacement cost of data blocks, and the formulated
optimization problem is solved by a tabu search algorithm designed for the knapsack
problem. However, again these optimization-based methods usually suffer from poor
stability and high overheads. Breitbach et al. [12] have also studied both data
placement and task placement in edge computing by considering multiple context
dimensions. For its data placement part, the proposed data management scheme
adopts a context-aware replication, where the parameters of the replication strategy
are tuned based on context information (such as data size, remaining storage, stability,
and application).
Most recently, Huang et al. [30] have studied caching fairness for data sharing in
edge computing environments. They propose fairness metrics to take resources and
wireless contention into consideration and formulate the caching fairness problem as
an integer linear programming problem. Then they propose an approximation algo-
rithm based on the connected facility location algorithm and a distributed algorithm.
Xie et al. [123] studied the data-sharing problem in edge computing and proposed
a coordinate-based data indexing mechanism to enable efficient data sharing in edge
computing. It maps both switches and data indexes into a virtual space with asso-
ciated coordinates, and then the index servers are selected for each data based on
the virtual coordinates. Their simulations showed that both the routing path lengths
and forwarding table sizes for publishing/querying the data indexes are efficient. Xie
et al. [122] further extended their virtual-space method to handle data placement
and retrieval in edge computing with an enhancement based on centroidal Voronoi
tesselation to handle load balance among edge servers. Both [123] and [122] inspire
our work on data placement with data popularity (adopting a virtual-space-based
placement method with greedy routing-based retrieve), but they do not consider the
data popularity of data items.
34
Note that there are other types of resource management problems in edge comput-
ing, such as virtual network function placement [90, 130], service placement [79, 22],
and cloudlet placement [140, 139, 129]. These problems are different from the data
placement problem, and their solutions could not solve the considered data placement
problem here.
2.9 Chapter Summary

Data placement has become a critical issue in edge computing where data is gen-
erated from edge units and stored within an edge network. In this chapter, we have
studied data placement strategy for edge networks which takes data popularity into
consideration. Based on their data popularity, data items are mapped to a virtual
plane where the network topology of edge servers is embedded. The mapping method
puts more popular data closer to the network center in order to shorten its retrieved
path from all other regions. Corresponding placement and retrieval strategies can
then be easily performed based on distance measurement between virtual coordina-
tors of data items and edge servers. We also design several multiples offloading and
replication methods to overcome the storage limits and further improve the perfor-
mance. Simulation results confirm the effectiveness and efficiency of our proposed
strategies.
35
CHAPTER 3
JOINT RESOURCE PLACEMENT AND TASK

DISPATCHING
3.1 Introduction
With the proliferation of Internet of Things (IoT) data and innovative mobile
services, there is a growing demand for low-latency access to resources such as data
and computing services. Mobile edge computing has evolved into an effective com-
puting paradigm for meeting the need for low-latency access by locating resources
and dispatching tasks at edge clouds close to mobile users.
As shown in Fig. 3.1, a typical edge computing environment consists of mobile
users, edge clouds (including multiple edge servers connected by the edge network),
and a remote cloud (usually within data centers). Each edge server is generally de-
ployed at the network edge near mobile users and owns specific storage, CPU, and
memory capacity. Mobile users can generate a couple of computation tasks at any
location which requests to be dispatched at edge servers with sufficient resources (i.e.,
internal computation resources such as CPU, memory, storage) and may also require
certain data or services (i.e., external resources such as training data or machine
learning services). Note that the types of computing tasks from mobile users/devices
are heterogeneous due to diverse settings and applications. For example, some tasks
may only request data (e.g. image, video) or machine learning (ML) model from the
edge network, and then process it locally or perform ML computation based on the
model at the local edge server. Some tasks may request computation at other edge
servers with certain computation services, such as video analysis, speech recognition,
and 3D rendering. Some tasks may need a combination of data, services, and com-
putation resources, such as distributed federated learning or interactive augmented
reality. Fig. 3.1 shows some examples where tasks from mobile users request either
data/services or both. Note multiple user tasks can be served by the same edge server
and the deployment of multiple copies of resources can usually reduce the accessing
cost or balance loads among servers. The diverse types of tasks from mobile users and
36
Edge Clouds Remote
Cloud
data E service X
data C, E
Edge Server 4
Server
Server 2
task 1 da data D
data A, B ta service Y
E
Edge Network
Server 1 Server 5 task 2
task 3
data B
Base
Station Server 3
task 3: request service Y
task 1: request data E User 3
User 1
Mobile Users
task 2: request data E and service Y
User 2
Figure 3.1: A typical edge cloud environment.
dynamically available resources at edge servers introduce new challenges in resource

management and task dispatching in such a complex edge computing system.
In a real dynamic edge computing environment, tasks from mobile users generally
have a small size and can be easily moved around and distributed at different edge
servers for processing. However, the resources, such as data and services, may not
be adjusted fast enough to meet the dynamic requirements of tasks. For example,
it takes time to reconfigure a service in a new edge server. Similarly, migrating
a large amount of data also involves additional costs. Therefore, it is natural to
manage resources and tasks at two different time scales, i.e., task dispatching can be
performed in a fast timescale, while resource placement can occur in a slow timescale.
Such multi-timescale solutions have been shown to be more efficient than single-
timescale methods in edge computing [22, 132]. In addition, a critical factor that
has been overlooked is the dynamic status of edge servers. Edge servers are not
always running due to regular maintenance or certain events (e.g., power outages and
system errors). If the status of an edge server is changed, the overall topology of the
edge network is changed and this further affects the performance of the entire edge
system. Therefore, it is important to take server status into resource placement and
task dispatching. In this chapter, we jointly study the resource placement and task
dispatching problems in mobile edge computing with the aim of maximizing the total
utility of performed tasks.
37
3.2 System Models and The Optimization
In this section, we first introduce our network and system models under a general
edge computing architecture. Then we formulate the resource placement problem,
the task dispatching problem, and the joint optimization problem, respectively.
3.2.1 Network and System Models

Without loss of generality, we construct a typical mobile edge computing archi-
tecture as shown in Fig. 3.1. The edge network topology is defined as graph G(V, E),
consisting of N edge servers and M direct links among them. Here, V = {v1 , · · · , vN }
and E = {e1 , · · · , eM } are the set of edge servers and the set of links, respectively.
For each server vi ∈ V , it has a maximal storage capacity ci , a CPU frequency fi , a
memory capacity mi , and the current remaining storage capacity cci . For each link
el ∈ E, it has a propagation delay pl and a network bandwidth bl .
Assume that there are X data items (D = {d1 , · · · , dX }), Y services (S =
{s1 , · · · , sY }) and Z computing tasks (U = {u1 , · · · , uZ }). Since both data items
and services can be considered as needed resources for computing tasks, we treat
them as O = X + Y resources in total, i.e., Q = {q1 = d1 , . . . , qX = dX , qX+1 =
s1 , . . . , qO = sY }. For each resource qj , it has a storage size of oj , a download cost ϖj
from the cloud, a CPU requirement ζj and a memory requirement ηj , respectively.
Note that for the data resource, its CPU and memory requests are set to 0. Each task
uk has a requested resource set Ωk , a CPU requirement γk , a memory requirement δk ,
a size of expected output data βk , the arriving server Ψk , and a benefit ρk .
To define the requested resources for task uk , we introduce a binary variable ωk,j
as the indicator of whether resource qj is required by task uk .

1, resource qj is required by task uk ,
ωk,j =
0, otherwise.
Then, the requested resource set Ωk = {qj |ωk,j = 1}, and its input resource size αk
can be calculated as αk = O
P
j=1 ωk,j oj . Note that the resource requested by task uk
cloud be either data items or specific services.

We assume that tasks arrive at discrete time units t. The duration of such time
unit is τ . Later, we will discuss the case where multiple time scales are used later. For
38
each server vi , we also assume there is a status indicator stti to represent whether this
server is available at time t (available when stti = 1, not available when stti = 0). There
are two possible causes of unavailability: predictable (such as scheduled updates or
maintenance) or sudden events (such as power outages). Here we mainly consider
the first type of cases. For the latter case, different backup strategies should be
considered.
3.2.2 Resource Placement

We first consider a resource placement problem where a placement decision is
needed for each resource qj at time t. A binary variable xtj,i is defined as the placement
decision in time t where resource qj is placed in edge server vi .

1, qj will be placed in vi at t,
xtj,i = (3.1)
0, otherwise.
Here, we assume that data items or services can have replicas in edge cloud (i.e.
PN t
i=1 xj,i can be larger than 1). In addition, an edge server may store multiple data
and services, but the total storage size placed in edge server vi cannot exceed its
current remaining storage capacity:
O
X
xtj,i oj ≤ stti · cci , for all vi . (3.2)
j=1
For services, there are also specific CPU and memory requirements on the placed
server.
xtj,i ζr ≤ stti · fi , for all vi , qj . (3.3)
xtj,i ηr ≤ stti · mi , for all vi , qj . (3.4)
The resource placement aims to maximize the total benefit minus the total cost
from all serving tasks while satisfying resource constraints. Here, we consider two
types of costs from serving tasks: placement cost and accessing cost.
For the placement cost of a resource item qj to a server vi during the placement,
we consider two possible ways: (a) directly downloading from the cloud with a cost
of ϖj , or (b) transferring from a nearby server vk , which holds a copy of qj at t − 1,
39
with a cost of f (qj , vk , vi ). Here, assume that Pj is the shortest path in Gt connecting
vk and vi 1 , then the cost f (qj , vk , vi ) can be defined as follow.

0, if vi = vk
f (qj , vk , vi ) = P (3.5)
oj
el ∈Pj ( bl + pl ), otherwise.

Thus, the placement cost of qj to vi at t is the minimal among all these, i.e.,

t
0, if xt−1
j,i = 1
pcj,i = (3.6)
min(ϖ , min (xt−1 f (q , v , v ))), otherwise.
j k̸=i j,k j k i
Note that if qj is already in vi at t − 1, no cost is needed. Then, the placement cost

for data qj at t can be defined as
N
X
νjt = xtj,i · pctj,i . (3.7)
i=1
We also define a variable to indicate whether resource qj is requested by any task:


1, if P ωk,j ≥ 1
k
ωj = (3.8)
0, otherwise.
t
For the accessing cost of resources after the data/service is placed, let σj,k be
the accessing cost for resource qj required by task uk . Note that the accessing cost
depends on which edge server task uk is processed at. Let Υk = Υ(uk ) be the server
assigned by the tasking dispatching of uk . The accessing cost of qj can be defined as
t
σj,k = min xtj,i f (qj , Υk , vi ). (3.9)
vi ̸=Υk
If without task dispatching, we assume that task uk is processed at its arriving server
Ψk , then the accessing cost is
t
σj,k = min xtj,i f (qj , Ψk , vi ). (3.10)
vi ̸=Ψk
In general, we define the accessing cost of qj from any edge server vi is
t
σ̄j,i = min xtj,l f (qj , vl , vi ). (3.11)
l̸=i
1
Here Gt represents the edge network formed by all available servers at time t. The shortest path
is defined regarding the summation of propagation and transmission delays of qj over the path.
40
Since each of serving tasks has benefit of ρk , the utility of each task uk can be
t
P
defined as j (ρk − ωk,j · σj,k ).
Now we can formulate the resource placement problem as an optimization problem.
The objective is to maximize total utilities from all serving tasks minus the summation
of accessing costs for all resources at time t.
XX X
t
max (ρk − ωk,j · σj,k )− ωj · νjt
k j j
X
s.t. xtj,i oj ≤ stti · cci , ∀i
j
xtj,i ζr ≤ stti · fi , ∀i, j (3.12)
xtj,i ηr ≤ stti · mi , ∀i, j

xtj,i ∈ {0, 1}, ∀i, j
i ∈ (1, 2, . . . , N ), j ∈ (1, 2, . . . , O).
3.2.3 Task Dispatching

In terms of task dispatching, we assume all tasks arrive in the edge network in an
arbitrary order. At the time t, the goal of task dispatching is to find an optimal edge
server vi to process each task uk in order to minimize the total completion cost of
the task. Specifically, the total completion cost of a task uk mainly consists of three
parts: (a) the accessing cost of resources required by uk , (b) the computation cost of
uk , and (c) the transmission cost of output data of uk .
t
We denote yk,i as the task dispatching decision at t whether task uk is dispatched
to edge server vi .

1, task uk is dispatched to server vi at t,
t
yk,i = (3.13)
0, otherwise.
PN t
Here we assume that each task is at most dispatched to a single server, i.e., i=1 yk,i ≤
1.
Note that there are different types of tasks: some only need data from the edge
network, some only need to perform general computation at any server either with
data or not, and some need to perform specific computation with certain services
at the available server. Our formulation can model all these task types. If task uk
41
only needs data, γk = 0, δk = 0 while αk > 0. If uk only needs general computation
without specific service or data, γk > 0, δk > 0 while αk = 0.
t
Assume that task uk is dispatched to edge server vi , i.e., yk,i = 1, then its associ-
ated costs are defined as follows.
Accessing cost of resources: The transmission cost of input data and needed
input
= O t
P
service for task uk is defined as Ck,i j=1 ωk,j · σ̄j,i .
Computation cost:. Let ξk (z) be the function to define CPU cycles to process
task uk with the input data/service size z. So the computation cost of task uk
ξk (oj )
comp
= O
P
processed in edge server vi is defined as Ck,i j=1 ωk,j · fi .
Transmission cost of output: The total transmission cost of output data for
output
task uk from edge server vi to arriving edge server Ψk is Ck,i = f (βk , vi , Ψk ).
Therefore, the completion cost of task uk is calculated as
t input comp output

ςk,i = Ck,i + Ck,i + Ck,i . (3.14)
Recall each task has a benefit ρk . We then can formulate the task dispatching
decision as an optimization problem whose goal is to maximize the total task utility
if task uk is running on the server vi at t.
XX
t t
max yk,i (ρk − ςk,i )
k i
X
t t
s.t. yk,i ςk,i ≤ τ, ∀i
k
t
yk,i αk ≤ stti · cci , ∀i, k
t
yk,i γk ≤ stti · fi , ∀i.k (3.15)
t
yk,i δk ≤ stti · mi , ∀i, k
X
t
yk,i ≤ 1, ∀k
i
t
zk,i ∈ {0, 1}, ∀k, ∀i
i ∈ (1, 2, . . . , N ), k ∈ (1, 2, . . . , Z)
t t
P
Note that the constraint of k yk,i ςk,i ≤ τ makes sure that the dispatched tasks can
be completed within the duration of a time scale τ .
42
3.2.4 Joint Optimization Problem
We now consider a joint resource placement and task dispatching problem as a
nonlinear program problem:
XX X
t t
max yk,i (ρk − ςk,i )− ωj · νjt (3.16)
k i j
X
s.t. xtj,i oj + yk,i
t
αk ≤ stti · cci , ∀i, k (3.17)
j
xtj,i ζj ≤ stti · fi , ∀i, j (3.18)

xtj,i ηj ≤ stti · mi , ∀i, j (3.19)
t
yk,i γk ≤ stti · fi , ∀i, k (3.20)
t
yk,i δk ≤ stti · mi , ∀i, k (3.21)
X
t
yk,i ≤ 1, ∀k (3.22)
i
X
t t
yk,i ςk,i ≤ τ, ∀i (3.23)
k
xtj,i ∈ {0, 1}, t

yk,i ∈ {0, 1} (3.24)
i ∈ (1, . . . , N ), j ∈ (1, . . . , O), (3.25)
k ∈ (1, . . . , Z). (3.26)
t t
Since there is a nonlinear term inside yk,i ςk,i , the overall problem is a nonlinear integer
program problem which is known difficult to solve due to its high computational
complexity.
3.3 Two-Stage Optimization Method

To solve the challenging joint optimization problem, we propose a two-stage al-
gorithm to decompose the problem and solve it via multiple iterations. One of the
advantages of this proposed two-stage method, it can be easily adopted to perform
joint optimization across different timescales.
3.3.1 Two-Stage Optimization

The main idea of this algorithm is as follows. First, we randomly generate a
t,0
feasible task dispatching decision yk,i , then formulate and solve the resource placement
43
problem (obtaining xt,1
j,i ) to maximize the total task utilities. Next, we take the
resource placement decision xt,1

j,i as input, and formulate and solve the task dispatching
t,1
problem (obtaining yk,i ). This finishes the first round of two-stage optimization, then
we repeat the two steps, i.e., iteratively taking the latest resource placement or task
dispatching decision as an input to optimize the other decision within the overall joint
problem, until it satisfies a specific condition.
2-Stage Decomposition: The detail of the decomposition of ι-th round is as
follows.
Stage 1: Solving resource placement problem with fixed task dispatching. In this
stage, our goal is to determine resource placement for each data and service in order
t,ι−1
to maximize the total task utilities with the last task dispatching decision yk,i . The
problem can be formulated as P1:
X X t,ι−1 X
t
max yk,i (ρk − ςk,i )− ϱtj
k i j (3.27)
s.t. (3.17), (3.18), (3.19), (3.23), (3.24), (3.25), (3.26)
The solution to this problem is xt,ι

j,i .
Stage 2: Solving task dispatching problem with fixed resource placement. In this
stage, we take the resource placement decision xt,ι
j,i generated in the first stage as input
t,ι
and determine the task dispatching for each task yk,i to maximize the total utility.
The problem can be formulated as P2:
X X t,ι X
t
max yk,i (ρk − ςk,i )− ϱtj
k i j (3.28)
s.t. (3.17), (3.20) − (3.26)
t,ι
The solution of this stage is yk,i .
After the decomposition, in each round, both P1 and P2 are linear integer pro-
gramming problems, and thus can be solved by the classical linear programming
methods (e.g., dynamic programming, branch and bound).
Overall Iteration, Initialization, and Termination: Algorithm 3 shows the
t,0
overall algorithm. Initially, a feasible random task dispatching yk,i is generated (Line
2). Then, in each round (Lines 5-12), we solve the P1 and P2 with the previous
decision as the input. The resource placement and task dispatching decisions (xt,ι
j,i
t,ι
and yk,i ) are optimized iteratively. Finally, the iteration terminates (Line 13) when
44
Algorithm 3 Two Stages Optimization Method
Input: Status of all servers V and the network G, resources Q and tasks U for
time t.
Output: Resource placement and task dispatching decisions xtj,i and yk,i
t
.
1: Initialize max itr, max occur, bound val
t,0
2: Generate an random initial task dispatching decision yk,i which is feasible (i.e.,
satisfying constraints in P2)
3: ι = 1 and count num = 0;
4: repeat
5: Stage 1: Calculate xt,ι t,ι−1
j,i by solving P1 with yk,i as the fixed task dispatching
t,ι
6: Stage 2: Calculate yk,i by solving P2 with xt,l
j,i as the fixed resource placement,
let obj val be the achieved objective value (total utility from tasks)
7: if obj val > bound val then
8: bound val = obj val; count num = 1
9: xtj,i = xt,ι t t,ι
j,i ; yk,i = yk,i
10: else if obj val = bound val then

11: count num = count num + 1
12: ι=ι+1
13: until count num = max occur or ι = max itr
14: t
return xtj,i and yk,i
either of the following metrics is met: (1) the number of iteration reach a certain
threshold max itr, or (2) the current objective value (total task utility) has occurred
more than a specified threshold max occur. These two thresholds can be set via
experiments. Obviously, larger threshold values lead to longer iteration but improved
results.
3.3.2 Joint Optimization across Two Timescales

So far, we only discuss our two-stage algorithm in a one-time slice. In edge com-
puting systems, the workload (i.e., computing tasks) and the resources (e.g. data or
services) to serve such workload need to be managed on different timescales [22, 132].
Usually, the computing tasks could be distributed more frequently at a fast timescale
45
Fast Timescale
Task Dispatching
Time Slot 1
Time Slot 2
Time Slot 1
Time Slot 2
Time Slot 1
Time Slot 2
Time Slot χ
Time Slot χ
Time Slot χ
… … … …
Time Frame 1 Time Frame 2 Time Frame 3
Slow Timescale
Joint Resource Placement and Task Dispatching
Figure 3.2: Illustration of joint resource placement and task dispatching across two
timescales.
in the edge network, while the resource placement could be adjusted (such as rede-
ploying or migrating services) less frequently on a slow timescale. Compared with the
single timescale method, multi-timescale solutions [22, 132] can achieve better per-
formance with more flexible management, thus gaining significant attraction recently
from the research community.
Our proposed two-stage algorithm can be easy to adopt to a two-timescale solu-
tion. As illustrated in Fig. 3.2, we can make task dispatching decisions along with the
fast timescale (at the starting point of each time slot) and make resource placement
decisions along with the slow timescale (at the starting point of each time frame).
Here, we assume that each time frame includes χ time slots. More specifically, at
the beginning of each time frame, we run our proposed iterative two-stage algorithm
(Algorithm 3), and at the beginning of each time slot (except for the first time slot),
we only solve the Stage 2 problem (P2) where the resource placement is fixed. By
doing so, not only we can handle diverse dynamics among workload and resources,
but also the running time of the overall algorithm is reduced since the iterative algo-
rithm is only performed once at each time frame, and solving P2 at each time slot is
relevantly simpler. Thus, it leads to greater flexibility with more cost savings.
46
3.4 Reinforcement Learning based Method
In this section, we consider an alternative method to solve joint optimization
by leveraging the emerging deep reinforcement learning technique. Reinforcement
learning (RL) has a great capability to attack complex optimization problems in a
dynamic system. The characteristic of the RL framework is that the decision is made
by RL agents and the feedback generated by the environment is used to improve the
decision of the agent. There are three key elements in the RL frameworks: state,
action, and reward.
Generally, RL algorithms can be classified as categories of value-based and policy-
based methods. Value-based RL methods (e.g. Q-learning, Deep Q-network (DQN)
[69], Double DQN [104]) can select and evaluate the optimal value function with
lower variance. The value function measures the goodness of the state (state-value)
or how good is to perform an action from the given state (action-value). However, it
is difficult for value-based methods to handle the problem of continuous action spaces.
If it calculates the value in an infinite number of actions, it will be time-consuming.
On the other hand, policy-based methods, such as policy gradient [53], are effective
in high-dimensional or continuous action spaces. It can learn stochastic policies and
has better convergence properties. The main idea is to able to determine at a state
which action to take in order to maximize the reward. The way to achieve this
objective is to find and tune a vector of parameters (θ) so as to select the best action
to take for policy π. The policy π is the probability of taking action a when at state
s and the parameters are θ. There are some disadvantages to policy-based methods:
(1) it typically converges to a local rather than global optimum; (2) evaluating a
policy is typically inefficient and has high variance.
Actor-Critic RL method [68] is proposed to combine the basic idea of value-based
and policy-based algorithms. The actor uses policy-based methods to select the
action while the critic uses value-based methods. As shown in Fig.3.3, the actor
takes the state as input and outputs the best action. It essentially controls how the
agent behaves by learning the optimal policy (policy-based). The critic, on the other
hand, evaluates the action by computing the value function (value-based). And the
feedback (such as an error) will tell the actor how good its action was and how it
should adjust. However, since the actor-critic method involves two neural networks,
47
Actor-Critic Framework
Critic Network state, reward

(Value-based)
feedback
action
Actor Network
(Policy-based)
Environment
state
Figure 3.3: The architecture of Actor-Critic RL framework.
each time the parameters are updated in a continuous state and there is a correlation
before and after each parameter update, which causes the neural network to only look
at the problem one-sidedly, and even causes the neural network to learn nothing. To
avoid such a problem in our problem, we leverage Deep Deterministic Policy Gradient
(DDPG) RL technique [97, 53] to solve the joint optimization problem.
3.4.1 RL Framework: State, Action, and Reward

We first define the specific state vector, action vector, and reward for our system
model to enable the proposed RF framework.
State Vector: At each step ι, the agent collects the edge network information
and parameters defined below to form the system state.
• M : the number of links among edge servers.
• N : the number of edge servers.
• bl : available network bandwidth of each link.
• cri : available computing resources (e.g., storage, CPU, memory) of each edge
server.
Let SS be the state space, the system state ssι ∈ SS at step ι can be defined as
ssι = {b1 , b2 , · · · , bM , cr1 , cr2 , · · · , crN }ι .
Action Vector: In terms of the action vector, the agent will make decisions for
both resource placement and task dispatching. The decision mainly consists of where
48
to place resources and where to dispatch tasks. Therefore, the action vector includes
two parts.
• RPj = {rpj,1 , rpj,2 , · · · , rpj,N }: resource placement of each external resource qj

(data, service) .
• T Dk = {tdk,1 , tdk,2 , · · · , tdk,N }: dispatching target of each task uk (released by

mobile user).
Let AA be the action space, the system action aaι ∈ AA at step ι can be defined as
aaι = {RP1 , RP2 , · · · , RPR , T D1 , T D2 , · · · , T DZ }ι .
Reward: For each step, the agent will get the reward rrι from the environment
after taking a possible action aaι . Generally, the reward function is related to the
objective function in the optimization problem. Fortunately, the objective of our
optimization problem is to maximize the total utility of all tasks, so the award of the
RL agent is set as follows.
XX X
t
rrι = (ρk − ςk,i )− ϱtj . (3.29)
k i j
Notice that the reward rrι can be obtained given the agent’s action aaι , which includes
the solution of both resource placement and task dispatching, and the environment.
3.4.2 DDPG RL Algorithm

The main goal of the RL algorithm is to tune the learning model’s parameters
(θ) so as to select the best action aa to take based on the given state. We adopt the
Deep Deterministic Policy Gradient (DDPG) technique [97, 53] to perform the RL.
Actually, DDPG integrates the essential idea of the actor-critic and DQN. DQN uses
a replay memory and two sets of neural networks with the same structure but different
parameter update frequencies, which can effectively promote learning. DDPG has a
similar idea but the neural network is a bit complicated. As aforementioned, compared
with other RL methods, policy gradient can be used to filter actions in continuous
action spaces. Moreover, the screening is performed randomly based on the learned
action distribution. However, the screening in DDPG is deterministic but not random.
49
Actor Network (Policy) Critic Network (Value)
② ②
Optimizer Optimizer
Policy gradient
ssi, aai, θμ ⑧ Update θμ ⑧ Update θQ
⑧
Q gradient
ssi, aai, θQ
gradient
ss, aa, rr
Evaluation Evaluation
Network Network
argument: θμ aa = μ(ssι) argument: θQ
μ(ssι)
Update Update
⑨ ⑨ zi ⑦
θμ' ← θμ ⑥ θQ' ← θQ
Random Target aa' = μ’(ssι+1) Target

Noise Network Network
argument: θμ' argument: θQ'
④
⑦
Sample Data (ssi, aai, rri, ssi+1)
④ ss1 ssι+1 ⑥
⑤ Store (ssι, aaι, rrι, ssι +1)
aaι rrι
Environment Replay Buffer/Memory D
⑥ ③
①
Figure 3.4: The architecture of DDPG RL Algorithm. The circled numbers are the
corresponding steps.
In terms of the architecture of neural networks in DDPG, it is similar to that of

Actor-Critic, both need the policy-based neural networks and the value-based neural
networks as shown in Fig. 3.4. Each kind of neural network also includes two types
of neural networks: the evaluation network and the target network. The target
networks are time-delayed copies of their original networks that slowly track the
learned networks. Using these target networks greatly improves stability in learning.
The main steps of the DDPG algorithm are as follows.
1. Initialize the system and environment based on the edge network G, and set of
external resource Q and set of task U as well as other network information.
′
2. Initialize Actor evaluation network µ(s|θµ ) and target network µ′ (s|θµ ) as well
′
as Critic evaluation network Q(ss, aa|θQ ) and target network Q′ (ss, aa|θQ ),
′ ′
where θµ and θQ are evaluation network parameters, θµ and θQ are target
network parameters.
3. Initialize replay buffer D, the maximum number of episodes max ep and the
maximum number of steps per episode max st. D is used to sample experience
to update neural network parameters.
4. At the beginning of each episode, initialize the random exploration noise and
50
generate the initial state ss1 .
5. For each step ι, the actor selects an action aaι based on the current policy and
random noise.
6. The environment executes action aaι and get the reward rrι and observe new
state ssι+1 . Then it stores the transition (ssι , aaι , rrι , ssι+1 ) to D. At the same
time, the actor sends the action to the critic network.
7. Randomly sample a batch of data (ssi , aai , rri , ssi+1 ) from D. Then calculate
the expected value/reward zi .
8. Update Critic and Actor evaluation network with the sampled data.
9. Update Actor and Critic target network with the rate ε.
10. This process is done until it reaches the maximum number of episodes.
3.4.3 RL Method across Two Timescales

While the RL technique can handle network dynamics, it is also flexible to deal
with the complexity in multiple timescales scenarios. We now further extend our
proposed DDPG to work across two timescales. There are two different ways to
extend the proposed DDPG method. A straightforward way is to build another
separate DDPG for task dispatching problem P2 and run both DDPG models in
different timescales (joint one for each time frame and P2 one for each time slot).
The other way is to use the same DDPG model but force the action policy to not
adjust the resource placement during the fast timescale. Either way, the agent can
still learn the best decision based on the environment and the current state vector.
In this paper, we adopt the first method, as shown in Fig. 3.5. We use two DDPG
networks, one for resource placement (RP DDPG) and the other for task dispatching
(TD DDPG). Resource placement (RP DDPG) is performed every specific time frame
while task dispatching (TD DDPG) is executed every time slot. In each time slot,
the environment sends the current network state (available network bandwidth and
computing resources) to the task dispatching agent (TD agent), and the TD agent
will output the task dispatching decision to the environment.
51
input
Every specific
timeframe
Environment
RP DDPG
storage capacity Model
CPU frequency
output
Actions
memory capacity
Resource
Placement
Every time slot

task requests
input
resource size
download cost
CPU, memory req.

TD DDPG
Model
available network bandwidth,
available computing resources
output
Actions
Every time slot Task Every specific
Dispatching timeframe
Figure 3.5: Resource placement and task dispatching via deep reinforcement learning
across two timescales with two DDPG models.
3.5 Evaluation
This section reports the results from our trace-based simulations to evaluate our
proposed strategies.

In our simulation, we randomly construct edge networks G with 10 to 50 edge
servers whose degree satisfies a binomial distribution. The propagation and band-
width for each network link are randomly generated. Each edge server has a limited
storage capacity ranging from 512MB to 1, 024MB. To simulate the CPU, memory,
and status of edge servers, we make use of the Google Cluster Data (ClusterData
52
Table 3.1: RL Hyper Parameters
Parameter Value Parameter Value
Max Episode 100 Reward Discount 0.9
Max Step per Episode 3,000 Batch Size 32
Learning Max Episode 10 Soft Replacement 0.01
Actor Learning rate 0.0001 Replay buffer Capacity 10,000
Critic Learning rate 0.0002
2011 traces) [86]. For the external resources (data and services), we randomly gener-
ate 100 data items and 20 services where the size of each resource is from 10MB to
200MB. To simulate the tasks from mobile users, we leverage the user mobility data
from the CRAWDAD dataset kaist/wibro [70], developed by a Korean Team, which
collected the CBR and VoIP traffic from the WiBro network in Seoul, Korea. We
randomly sample from this dataset to generate the random tasks from mobile users
to perform our simulation. We run our experiments on a DELL Precision 3630 Tower
with an i7-9700 CPU, 16GB RAM, and NVIDIA GeForce RTX 2060 GPU. For our
proposed RL-based method, the detail of hyperparameters configuration is reported
in Table 3.1. The parameters are initialized by the general value that is used in most
RL experiments. We test multiple values for each parameter and select the value
that has better performance. We compare our proposed Two Stage Optimization
(OPT) and Deep Reinforcement Learning (RL) solutions with two baselines: a
random strategy and a greedy strategy.
• Random (RAND). At each time slice, it randomly generates a feasible resource-

placement and task-dispatching decision that satisfies those constraints.
• Greedy (GRD). It greedily determines its resource placement and task dis-
patching decision to maximize total utility in each round. It gives the priority to
resources/tasks based on their popularity/benefits. Specifically, GRD first sorts
resources based on their popularity and process them from the most popular
one. It iteratively selects an edge server to place this resource which maximizes
the total utility in each round. Similarly, for tasking dispatching, GRD sorts all
tasks based on their benefits and processes the most beneficial task first. Like-
wise, it greedily selects an edge server to dispatch the task to get the maximal
task utility in each round iteratively.
53
10000 8000
RAND RAND
Average total utilities

GRD

7000 GRD
8000 OPT OPT
RL RL
6000 6000
4000 5000
2000 4000
0 3000
10 20 30 40 50 10 20 30 40 50
Number of requests Number of edge nodes
(a) number of task requests (b) number of edge servers
Figure 3.6: Overall performance of four methods in one timescale.
We evaluate the performance of all methods based on average total utility (i.e., the
objective function in our formulated optimization problems). Obviously, the larger
the utility value the better resource placement and task dispatching performance. All
parameters required to calculate the objective function (such as network topology,
bandwidth, task requirements, server capacity, and download cost) are known to all
methods as inputs at each time unit. For RL methods, those are used to calculate
the reward at each time unit.
3.5.2 Overall Performance

In the first set of simulations, we test all four methods within a fixed time period
(in the single timescale) over different numbers of tasks or edge servers.
Fig. 3.6(a) displays the performances for the four solutions under the different
number of task requests (from 10 to 50 in each time unit). The number of edge servers
is fixed at 30. It is obvious to see that the average total benefits of four solutions
increase as the number of task requests increases. Our proposed two-stage optimiza-
tion algorithm (OPT) and Reinforcement Learning (RL) outperform the other two
algorithms (RAND and GRD) in all cases. In addition, when the number of requests
is low (e.g. 10 or 20), the difference in average total utilities between OPT and RL is
small. However, as the number of requests increases, the difference becomes larger.
So, in the real scenario, we can select either OPT or RL if the number of requests is
low. If the number of requests is large, we prefer to use RL to make the decision.
We then fix the number of tasks at 30 and investigate the impact of the number
of edge servers (changing from 10 to 50). As shown in Fig. 3.6(b), the average total
54
15.0
1895
Task utilities per slot

12.5
1890
Running time
10.0 OPT
RAND 1885
7.5 GRD 1880
5.0
1875
2.5
1870
0.0
0 5 10 15 20 25 30 0 20 40 60 80 100
Time slots Iteration
(a) running time (b) total utility per slot
Figure 3.7: Running time and convergence of OPT.
utility of RAND increases in the beginning and then less varies as the number of
edge servers increases. For the other three solutions, OPT and GRD vary a little as
the number of servers increases while RL keeps stable all of the ways. Overall, the
performance of most of the solutions is relevantly stable, especially RL. For all cases,
RL and OPT perform much better than GRD and RAND. This once again confirms
the advantage of our two proposed methods.
3.5.3 Running Time and Convergence of OPT

We first investigate the running time and convergence of our proposed two-stage
optimization (OPT) method.
Fig. 3.7(a) shows the running time of OPT, GRD, and RAND at different time
slots. The running time is defined as the time duration when the algorithm is exe-
cuting. We can find that GRD and RAND have the least running time since their
placement/dispatching can be done in a polynomial time. Our OPT method spends
more time to solve the challenging optimization problem, but remember that it gen-
erates a much better solution (better total utilities) than GRD/RAND as shown in
Fig. 3.6.
Recall that our two-stage optimization algorithm (Algorithm 3) iteratively opti-
mizes the objective value under a max iteration. Fig. 3.7(b) displays the total task
utility per slot under different iterations. It is clear that with more iterations the
overall trend of performance increases, even though there is a drop in early iteration
and some variety in each iteration. Therefore, it is necessary to select an appropri-
ate max iteration (max itr) to achieve a decent performance (total utility). It is a
55
6000 7000
Slow Timescale Always ON
5800

Fast Timescale Dynamic Status
5600 Two Timescales 6000 Static Status
5400
5000
5200
5000 4000
4800
bot AND
OPT RAND
T
D+ D
D
bot OPT
OPT GRD
RAN + GRD
D + OPT
RAN + OP
GR + GR
RAN
hR
4600
h
bot
D
D
Two Stages Random Greedy
GR
Solutions Solution Combinations
(a) single vs two timescales (b) dynamic vs static status
Figure 3.8: Performance of OPT across two timescales with dynamic status.
trade-off between the max iteration and the running time as well as the optimization
objective value since more iterations cost more running time.
3.5.4 OPT across Two Timescales with Dynamic Status

We further investigate our proposed methods across two timescales where the joint
resource placement and task dispatching decisions are made at different timescales
(time frame vs time slot). Here we mainly focus on our two-stage optimization solu-
tion (OPT).
In the first set of experiments, we perform our proposed OPT method against
RAND/GRD in three different scenarios: (1) Slow Timescale: all methods perform
joint resource placement and task dispatching at the beginning of each time frame;
(2) Fast Timescale: all methods perform placement and dispatching at the beginning
of each time slot; (3) Two Timescales: all methods perform task dispatching in each
time slot while joint resource placement and task dispatching performing only at the
beginning of each time frame. In this set of experiments, each time frame has 5 time
slots (i.e., χ = 5), and we fix the number of requests per time frame at 30 and the
number of edge servers at 30.
Fig. 3.8(a) displays the performances of three methods (OPT, RAND, GRD) under
three scenarios. First, our proposed two-stage OPT method achieves better perfor-
mance than RAND and GRD in all settings. Second, for all three solutions, running
at the slow timescale achieves larger utilities than running at the fast timescale. This
is mainly due to running at a slow timescale taking the advantage of having better
global information over a longer time duration. In addition, fast timescale solution
56
also suffers from frequent resource placement changes which might be costly. Third,
when the solutions are performed across two timescales, the performances can be
further improved. This might be due to performing task dispatching at the time slot
can find a sufficient server to perform the task and quickly release the server for other
tasks. Overall, the results from this set show that a multi-timescale solution can
achieve better performance compared with the single timescale method, which echos
a similar discovery from [22, 132] (though the studied problems and network models
are different).
Finally, we evaluate our proposed two-timescale solutions over edge servers with
dynamic status by leveraging the status trace-driven data from the Google Cluster
Data (ClusterData 2011 traces) [86]. We use the trace data to generate the server
status at different time slots. Other parameters are similar to previous experiments.
For two-timescale solutions, we use different combinations of OPT/GRD/RAND
to solve data placement and task dispatching problems respectively. As shown in
Fig. 3.8(b), there are nine combinations in total. For example, OPT+RAND means
the optimization-based method is used for data placement, while task dispatching is
done randomly. Fig. 3.8(b) reports the results of these methods under three different
scenarios: (1) Always On: assume that all edge servers are always running and avail-
able for serving tasks; (2) Dynamic Status: the status of the edge node varies along
with the time slot, while a server is down at a time slot no task can be dispatched to
it; (3) Static Status: our method completely ignore the server status during solving
the data placement and task dispatching. Obviously, all combinations with dynamic
status have lower total utility than those of always on, since some servers may be
unavailable in certain time slots. In addition, if ignoring the status, the performance
(of static status) will be significantly reduced, since the dispatched tasks may not be
completed due to the server being unavailable. Clearly, our solutions which consider
dynamic status can achieve a comparative performance in the case where every server
is on. Last, among all nine combinations, using our optimization-based solution for
both resource placement and task dispatching across two timescales has higher per-
formance than other combinations. This indirectly illustrates the effectiveness of the
two-stage algorithm under two timescales to handle real dynamics in edge computing,
which is the major contribution of this paper.
57
1.490 1e6 1e6
1.485 1.564
1.480 1.562
Rewards
Rewards
1.475 1.560
1.470
1.558
1.465
1.460 1.556
0 20 40 60 80 100 0 20 40 60 80 100
Episodes Episodes
(a) single timescale (b) across two timescales
Figure 3.9: Convergence of RL under different timescales.
3.5.5 Performance and Convergence of RL

In this subsection, we study the performance and convergence of our proposed
deep RL methods. The default number of edge servers is set to 10.
Convergence performance of RL under single timescale and different
timescales: Fig. 3.9(a) displays the convergence result of our RL solutions that
jointly determine the resource placement and task dispatching decision in a single
timescale. As we can see, the reward gets higher as the number of episodes increases
and it converges at around the 80th episode. On the other hand, Fig. 3.9(b) shows
the convergence of our RL solutions across two timescales where makes the task
dispatching decision in the fast timescale and the resource placement decision in the
slow timescale. We can find that the reward drops in the beginning and then increase
when the training episode increases. We also observe that the reward in Fig. 3.9(b) is
higher than that in Fig. 3.9(a). This further confirms the benefit of making resource
placement and task dispatching across two timescales.
Convergence performance under different batch sizes/learning rates:
Finally, we investigate the convergence of our proposed deep RL method with different
batch sizes and learning rates. Fig. 3.10(a) shows the performance of RL with a batch
size at 32, 64 and 128. The batch size is used to determine the number of experience
samples that need to be trained each step. We can find that the result of batch
size at 32 gets higher rewards and converges earlier than the other two scenarios.
Fig. 3.10(b) shows the performance of RL at different learning rates ε, which is used
to control the update speed of the weight in the neural network. Here, we use different
rates for the actor and critic (denoted by LC A and LC C respectively). Obviously,
58
1.490 1e6 1e6
1.490
1.485 1.485
1.480 1.480
Rewards
Rewards
1.475 1.475
1.470 1.470
1.465 batch size = 32 1.465 LC_A = 0.0001, LC_C = 0.0002
batch size = 64 LC_A = 0.0005, LC_C = 0.001
1.460 batch size = 128 1.460 LC_A = 0.001, LC_C = 0.002
0 20 40 60 80 100 0 20 40 60 80 100
Episodes Episodes
(a) different batch size (b) different learning rate
Figure 3.10: Convergence of RL under different batch size and learning rate.
different learning rates will lead to different convergence results so we have to select
an appropriate learning rate for our RL model.
3.6.1 Resource Placement/Management

In this paper, we consider both data placement and service placement as resource
placement in edge computing. Note that there are other types of resource management
problems in edge computing, such as virtual network function placement [90, 130],
virtual machine placement [94, 137], and cloudlet placement [140, 139, 129]. Next,
we briefly review existing works on data placement and service placement.
Data placement has been an important topic in distributed database/system
[17, 13], peer-to-peer networking[59, 100], content delivery network[19], and cloud
computing[109, 133]. While similar to all distributed systems, edge computing has
its own characteristics [95], thus brings new data placement problems. Shao et al.
[93] proposed a data replica placement strategy for processing the data-intensive IoT
workflows in edge system which aims to minimize the data access costs while meeting
the workflow’s deadline constraint. The problem is modeled as a 0–1 integer program-
ming problem and solved by an intelligent swarm optimization. Similarly, Lin et al.
[55] also proposed a self-adaptive discrete particle swarm optimization algorithm to
optimize the data transmission time when placing data for a scientific workflow in edge
computing. Li et al. [45] investigated a joint optimization of data placement and task
59
scheduling in edge computing to reduce the computation delay and response time.
Their formulated optimization considers the value, transmission cost, and replacement
cost of data blocks, which is then solved by a tabu search algorithm. Breitbach et al.
[12] have also studied both data placement and task placement in edge computing by
considering multiple context dimensions. For its data placement part, the proposed
data management scheme adopts a context-aware replication, where the parameters
of the replication strategy is tuned based on context information (such as data size,
remaining storage, stability, application). Huang et al. [30] have studied caching fair-
ness for data sharing in edge computing environments. They formulate the caching
fairness problem, where fairness metrics take resources and wireless contention into
consideration, and propose both approximation and distributed algorithms. Xie et
al. [123] also studied the data-sharing problem and proposed a coordinate-based data
indexing mechanism to enable the efficient data sharing in edge computing. It maps
both switches and data indexes into a virtual space with associated coordinates, and
then the index servers are selected for each data based on the virtual coordinates.
Xie et al. [122] further extended their virtual-space method to handle data placement
and retrieval in edge computing with an enhancement based on centrodial Voronoi
tesselation to handle load balance among edge servers. Similarly, Wei et al. [118, 119]
proposed another virtual-space based data placement strategy which takes the data
popularity of data items into consideration during the virtual-space mapping, data
placement and retrieval. There are solutions [65] for data management issues in edge
computing as well.
Similar to data placement, service and resource placement in edge computing has
been studied as well. Ouyang et al. [78] proposed an adaptive user-managed ser-
vice placement algorithm to jointly optimize the latency and service migration cost.
By formulating the service placement problem as a contextual Multi-armed Bandit
problem, they proposed a Thompson-sampling based online learning algorithm to
explore make adaptive service placement decisions. Xu et al. [126] studied the ser-
vice caching in mobile edge clouds with multiple service providers completing for
both computation and bandwidth resources, and proposed a distributed and stable
game-theoretical caching mechanism for resource sharing among the network service
providers. Pasteris et al. [79] also studied a multiple-service placement problem in
a heterogeneous edge system and proposed an approximation algorithm for placing
60
multiple services to maximize the total reward. Meskar and Liang [67] proposed a
resource allocation rule retaining fairness properties among multiple access points,
while Zhang et al. [134] proposed a decentralized multi-provider resource allocation
scheme to maximize the overall benefit of all providers. Resource placement has also
been considered jointly with other design issues in edge networking and computing.
For example, Kim et al. [37] designed a joint optimization of wireless MIMO signal
design and network resource allocation to maximize energy efficiency in wireless D2D
edge computing. Eshraghi and Liang [20] considered the joint optimization of com-
puting/communication resource allocation and offloading decisions of uncertain tasks
in mobile edge networks.
3.6.2 Task Offloading/Dispatching

Task dispatching, as known as computation offloading [10], is also a critical prob-
lem in edge computing, and has been studied recently. In many cases, it is jointly
considered with data/resource placement. For example, Breitbach et al. [12] also
considered task placement in their context-aware solution, where the task scheduler
allocates tasks according to the current context and observes the state during runtime.
Bi et al. [11] jointly studied a task offloading, service caching, and resource allocation
problem in a single edge server that assists a mobile user to perform a sequence of
computation tasks. They formulated it as a mixed integer nonlinear programming
(MINLP), and then solved it by separately optimizing the resource allocation and
transforming the problem to an integer linear program. Xu et al. [124] proposed an
online algorithm to jointly optimize dynamic service caching and task offloading in
edge-enabled dense cellular networks. Their solution is based on Lyapunov optimiza-
tion and Gibbs sampling without knowing future information. Similarly, Poularakis
et al. [82] investigated the joint service placement and request routing problem in
edge-enabled multi-cell networks and proposed a bi-criteria algorithm with a random-
ized rounding technique that achieves approximation guarantees while violating the
resource constraints in a bounded way. Ma et al. [61] studied cooperation among
edge servers and investigated cooperative service caching and workload scheduling in
a mobile edge computing environment. They formulated the problem as MINLP and
solved it by an iterative algorithm based on Gibbs sampling to achieve near-optimal
61
performance. Yang et al. [129] proposed a Benders decomposition-based algorithm
to jointly solve the cloudlet placement and task allocation problem while minimizing
the total energy consumption.
However, most of these works consider a kind of joint optimization at a single
timescale, and thus may not handle the dynamic among tasks, resources, and com-
putation facilities in the edge computing environment. Recently, Farhadi et al. [22]
studied service placement and request scheduling problems in edge cloud environ-
ments for data-intensive applications and proposed a two-timescales framework to
determine the near-optimal decision under specific constraints. You et al. [132] also
studied a joint resource provision and workload distribution problem in a mobile edge
network. They formulated the problem as a nonlinear mixed-integer program to min-
imize the long-term cost, and proposed online learning-based algorithms to solve the
problem in two timescales. Our work is inspired by these works, but we consider
different joint optimization with different network and edge settings. In addition, we
also leverage deep reinforcement learning to solve joint optimization.
3.6.3 Deep Reinforcement Learning

Reinforcement learning is one of the basic machine learning paradigms, which
has been well-studied and widely applied in many fields. Recent advances in deep
reinforcement learning (DRL) [97, 53, 69, 104, 68] have further enhanced its great ca-
pability to attack complex optimization problems in real dynamic systems, including
edge computing.
Chen et al. [15] have studied the computation offloading problem in a dynamic
time-varying network, and proposed a DQN-based solution to optimally offload the
computation to base stations to maximize the long-term utility performance. Li et al.
[46] considered the joint offloading and resource allocation in a multi-user edge system,
where multiple users can perform computation offloading via wireless channels to an
edge server. They proposed a DRL-based scheme to tackle the optimization. Huang
et al. [29] considered a binary task offloading in wireless edge system, and proposed a
DRL-based online offloading framework to adapt task offloading decisions and wire-
less resource allocations to the time-varying wireless channel conditions. Wang et al.
[106] also proposed a DRL-based resource allocation approach to adaptively allocate
62
computing and network resources to reduce the average service time and balance re-
source usages under a dynamic edge network. Ning et al. [75] solved the joint task
scheduling and resource allocation optimization in vehicular edge system to maximize
users’ Quality of Experience (QoE) by using a two-sided matching scheme for task
scheduling and a DRL approach for resource allocation respectively. Nath and Wu
[72] considered the computation offloading and resource allocation in a cache-assisted
edge system, and proposed a DDPG-based scheduling policy to minimize the long-
term average cost including energy consumption, total delays and resource accessing
cost. Meanwhile, Rahman et al. [84] also studied the joint problem of mode selec-
tion, resource allocation, and power allocation to minimize the total delay in the fog
radio access networks using DRL methods. While many of these works adopt DRL
to successfully optimize task scheduling/offloading and/or resource allocation, they
usually use one DRL agent to learn the dynamic. In our work, our DRL method has
been extended to work across two timescales.
3.7 Chapter Summary

In this chapter, we have investigated a joint resource placement and task dispatch-
ing problem in edge computing across different timescales. We proposed a two-stage
optimization algorithm and a deep RL-based algorithm to solve this joint optimization
within a dynamic edge environment. Both methods can handle a variety of dynamics
at two different timescales. Our simulation results showed that (1) both proposed
methods perform much better than random and greedy algorithms; (2) the advan-
tage of performing resource placement and task dispatching in different timescales
is not only to reduce the placement cost but also does not require much future pre-
diction of the task. The two proposed solutions have their own advantages. On one
hand, RL needs more time to train the agent’s model while OPT directly solves the
optimization problem. On the other hand, RL is more efficient to handle dynamic
environments and scales well with a larger number of requests/servers.
63
CHAPTER 4
JOINT PARTICIPANT SELECTION AND

SCHEDULING IN FL
4.1 Introduction
Mobile users, Internet of Things (IoT) devices, and artificial intelligence applica-
tions generate massive amounts of data today, providing potential training datasets
for a variety of machine learning (ML) tasks. Traditionally, for centralized machine
learning model training, the entire dataset is uploaded to a remote cloud center. How-
ever, due to limited network bandwidth and data privacy concerns, uploading a large
amount of data to a remote data center is not trivial. Edge computing combined
with distributed machine learning is a natural alternative because training data is
generated at the network edge, such as from smart sensing devices and smartphones
connected to the network edge. Nonetheless, there are numerous challenges to train-
ing ML models in the edge cloud. First, due to limited data and computing resources,
a single edge device/server may be incapable of performing a high-quality ML model
training task on its own. Second, edge devices/servers’ computing capacity and net-
work resources are limited and heterogeneous. Different edge units may result in
varying convergence speeds and performances when performing ML training tasks.
Third, edge resources are typically shared by a large number of mobile users or ap-
plications. The shared resources and competition among various users, edge servers,
and applications must constrain distributed ML training within the edge cloud.
To tackle the aforementioned challenges, a new distributed machine learning
paradigm has been proposed, called federated learning (FL) [64, 34, 91] that con-
ducts distributed learning at multiple clients without sharing raw local data among
themselves. Coupled with edge computing, FL over edge cloud has been recently
studied in various settings [54, 56, 108, 76, 58, 35, 66, 113, 116, 74]. In such a sce-
nario, several edge servers have been selected as participants (either parameter servers
or FL workers), and collaboratively train a shared global ML model without sharing
their local dataset and decoupling the ability to do model training from the need to
64
Model 1
FL Model 1
PS 1
Worker 1
PS 2 Worker 2
Worker 1
Model 2 Worker 2
Edge Server
Training Data
Worker 3 FL Model 2
FL Models
Figure 4.1: Example of multi-model FL over the edge.
store data in a centralized server. More precisely, as shown in Fig. 4.1, in each global
iteration, edge servers, worked as workers, first download the latest global model from
the parameter server (PS), and then perform a fixed number of local training based on
their local data. After that, edge servers will upload their local model to the parame-
ter server which is responsible for aggregating parameters from different workers and
sending the aggregated global model back to each FL worker. Previously, the efforts
of FL over edge have been focused on the convergence and adaptive control [108, 56],
the resource allocation and model aggregation [58, 113, 66], the communication and
energy efficiency [64, 131, 48].
In this chapter, we focus on a joint participant selection and learning optimization
problem in multi-model FL over a shared edge cloud1 . For each FL model, we aim
to find one PS and multiple FL workers and decide the local convergence rate for FL
workers. Note that both worker selection and learning rate control have been studied
in FL recently. With heterogeneous resources and capacities at edge devices, when
multiple FL models are trained at the same time, which FL model is preferentially
served at which edge server directly affects the total communication cost and compu-
tational cost of the FL training. The selection of participants (both the PS and FL
workers) for each model will also affect the learning convergence speed. Hence, we
aim to carefully select the FL participants for each FL model and pick the appropri-
1
As shown in Fig. 4.1, we consider an edge cloud architecture where a set of edge servers are
connected to each other without the remote cloud center to form an edge network to serve the users.
65
ate local learning rate for these selected FL workers, so as to minimize the total cost
of FL training of all models while meeting the convergence requirement from each
model.
4.2 System Model
4.2.1 Edge Cloud Model

We model the edge cloud as a graph G(V, E), consisting of N edge servers and
L direct links among them, as shown in Fig. 2. Here V = {v1 , · · · , vN } and E =
{e1 , · · · , eL } are the set of edge servers and the set of links, respectively. For each
edge server vi ∈ V , it has a storage capacity cti and a CPU frequency fit at time t.
For each edge link ej ∈ E, it has a network bandwidth btj at time t. We omit t from
the above notations when it is clear from the context.
Each edge server holds a certain distinct dataset collected from mobile devices/users
and can be used for local model training. We consider O types of datasets in the edge
cloud and use zi,k ∈ {0, 1} to indicate whether server vi stores the kth type of dataset
and Si,k to represent the raw sample data of the kth type stored at server vi . Note
that one edge server can hold multiple types of datasets.
4.2.2 Federated Learning over Edge

We consider parallel federated learning where multiple machine learning models
are trained in parallel within the edge cloud. Compared with the classical FL scenario
in that the remote cloud works as the parameter server (PS), we select a group of
edge servers with enough capacity as the participants (one PS and multiple workers)
of FL for each model. We assume that W FL models (M = {m1 , · · · , mW }) need
to be trained at the same time. For the training task of each FL model mj , it
requests (1) κj + 1 edge servers as participants, one as its PS and κj as its workers,
whose CPU frequency should be larger than its required minimal CPU frequency χj ;
(2) the selected workers must have the requested types of a dataset for mj , where
wj,k ∈ {0, 1} indicates whether mj needs the kth type of dataset; and (3) the achieved
global convergence rate needs to be larger than ςj . Here, we assume that each model
66
𝒕 𝒕+𝟏
Worker 4 Worker 4
Worker 1
PS
Worker 1 Worker 3
Worker 3
PS
Worker 2
Worker 2
Edge Server Global Model Local Model Edge Link
Model download and process 𝝑𝒕𝒋 global iterations Process 𝝋𝒕𝒋 local updates
Figure 4.2: The training process of an FL model within the edge network at different
time periods.
uses a fixed number of workers, and one worker can only perform FL training of one
model at one time.
We consider a series of consecutive time periods t = 1, · · · , T , and each time
period has an equal duration τ . As shown in Fig. 4.2, at each time t, we select the FL
participants for each model and then train W models in parallel through FL, which
consists of a number of global iterations (let ϑtj be the number of global iterations
of mj at t). For each model mj , each global iteration includes four parts: (1) the
selected parameter server initializes the global model of mj ; (2) the selected workers
download the global model from the parameter server; (3) each worker runs the local
updates using its holding raw dataset for φtj local iterations to achieve the desired
local convergence rate ϱtj ; (4) workers upload the updated model and related gradient
to the parameter server for the aggregation to upload the global model. The process
of federated learning at different time periods is shown in Fig. 4.2.
Next, we define our local training and global aggregation process as well as the
loss function during the federated edge learning at each time period.
Loss Function: Let all types of sample data used by jth model and stored in edge
t
S
server vi be defined by Dj,i = wj,k oi,k =1 Si,k . For each sample data d =< qd , rd >∈
t
Dj,i , where qd is the input data and rd is the output data/label, we define the average
67
loss of data for jth FL model on the server vi in time period t as Atji (p):
1 X
Atj,i (p) = t
H(I(qd ; p), rd ),
|Dj,i | t d∈Dj,i
where H(· ) is the loss function to measure the performance of the training model,
I(· ) is the training model and p is the model parameter.
Then the average loss of data for jth FL model on all related edge servers in time
period t is defined as follows:
t
X |Dj,i |
Atj (p) = Atji (p),
i
|Djt |
where Djt is the union of all involved training samples of model j at time t.
Local Training on FL Workers: For each global iteration of jth FL model
α ∈ [1, ϑtj ], the related edge server vi (FL worker) will perform the following local
update process:
pt,α t,α−1
j,i = pj
t,α
+ ωj,i ,
where pt,α
j,i is the local model parameters on edge server vi in the current iteration
and pt,α−1
j is the aggregated model downloaded from the parameter server in the last
t−1,ϑtj
iteration. And pt,0
j = pj
t,α
. ωj,i is the local update from a gradient-based method
and it can be calculated as follows.
φt
j jφt
t,α
X t,α,β
X t,α,β−1
ωj,i = ωj,i = {ωj,i − δ∇Lt,α t,α,β−1
j,i (ωj,i )},
β=1 β=1
t,α,β
where ωj,i is the model parameters for jth FL model in β-th local update and δ
is the step size of the local update. Lastly, Lt,α
j,i (· ) is the predefined local update
function. Based on [36], Lt,α

j,i (· ) is defined as below.
Lt,α t t,α−1
j,i (ω) = Aj,i (pj + ω) − {∇Atj,i (pt,α−1
j )
ξ2
− ξ1 Jjt (pt,α−1
j )}⊤ ω + ||ω||2 ,
2
X X
t t,α t t,α t
Jj (pj ) = ∇Aj,i (pj )/ yi,j ,
i i
where ξ1 and ξ2 are two constant variables. Jjt (· ) is the sum of gradients among all
related edge servers and this process will be performed in the global aggregation step.
68
Assume that Atj,i (· ) is λ-Lipschitz continuous and γ-strongly convex [14, 131],
then the local convergence of local model is represented as
t,φt
Lt,α j t,∗ t t,α t,0 t,∗
j,i (ωj,i ) − Lj,i ≤ ϱj [Lj,i (ωj,i ) − Lj,i ], (4.1)
where Lt,∗
j,i is the local optimum of the training model. Furthermore, we can set
t,0
ωj,i = 0 since the initial value can start from 0 for the training model.
Global Aggregation on Parameter Server: After the local updates for all
t,α
related FL workers, they have to upload the related local model parameter ωj,i and
the related gradients ∇Atj,i (pt,α
j ) to the parameter server for aggregation.
X X
pt,α
j = p t,α−1
j + {y t
i,j ω t,α
j,i }/ t
yi,j .
i i
Then, the global average loss of data for jth model is

t t
X yi,j |Dj,i |
Gjt (pt,α
j ) = Atji (pt,α
j ).
i
|Djt |
Similarly, the global convergence of the global model is defined as

t,ϑt
Gjt (pj j ) − Gjt,∗ ≤ ςj [Gjt (pt,0 t,∗
j ) − Gj ], (4.2)
where Gjt,∗ is the global optimum of the training model.

Finally, from formula (4.1) and (4.2), in order to achieve the desired local conver-
gence rate ϱtj and global convergence rate ςj , we need to calculate the number of local
updates φtj and the number of global iterations ϑtj . From the above observation, we
can find that the global convergence rate ςj for each FL model can be predefined and
we have to conduct the local update and global iteration to achieve that. Then we
have the following relationship between the convergence rate and the local update as
well as global iterations [36, 35].
2λ2 1 1 1 1
ϑtj ≥ ln( ) t
≜ ϑ0 ln( ) ,
2
γ ξ1 ςj 1 − ϱj ςj 1 − ϱtj
2 1 1
φtj ≥ log2 ( t ) ≜ φ0 log2 ( t ),
(2 − λδ)δγ ϱj ϱj
where ξ1 is the constant variable defined in function Lt,α

j,i (· ), λ is the λ-Lipschitz
parameter and γ is the γ-strongly convex parameter. Both the value of λ and γ are
2λ2
determined by the loss function. ϑ0 and φ0 are two constants where ϑ0 = γ 2 ξ1
and
2
φ0 = (2−λδ)δγ
.
69
4.3 Joint Participant Selection and Learning Op-
timization Problem
4.3.1 Problem Formulation

Under the previously introduced multi-model federated learning scenario, we con-
sider how to choose participants for each of the models and how to schedule their
local/global updates. Particularly, at each time period t, we need to make the fol-
lowing participant selection and learning scheduling decisions for each model mj . We
t
denote xti,j or yi,j as the decision whether to select edge server vi as a parameter
server or an FL worker for jth FL model mj at time t, respectively. Again, we as-
sume that only one PS and κj workers selected for each model, i.e., M t
P
i=1 xi,j = 1 and
PM t t
i=1 yi,j = κj . We use ϱj ∈ [0, 1) to represent the maximal local convergence rate of
mj at time t. We will use ϱtj and ςj to control the number of global iterations and
local updates for model mj at time t. Recall that ςj is given by the model mj as a
t
requirement, thus only ϱtj is used for optimization. Overall, xti,j , yi,j , and ϱtj are the
decision variables of our optimization in each time period t.
We now formulate our participant selection problem in multi-model FL where we
need to select the parameter server and workers for each model as well as achieve the
desired local convergence rate. The objective of our problem is to minimize the total
cost of all FL models at time t under specific constraints.
W
X
min ϖjt (4.3)
j=1
s.t. xti,j µj κj ≤ cti , t

yi,j µj ≤ cti , ∀i, j (4.4)
xti,j χj ≤ fit , t
yi,j χj ≤ fit , ∀i, j (4.5)
t
wj,k yi,j zi,k = 1, ∀i, j, k (4.6)
X X
xti,j = 1, t
yi,j = κj , ∀j (4.7)
i i
X
(xti,j + t
yi,j ) ≤ 1, ∀i (4.8)
j
xti,j ∈ {0, 1}, yi,j

t
∈ {0, 1}, ϱtj ∈ [0, 1). (4.9)
Here, ϖjt is the total FL cost of jth FL model in time t, which will be defined in
the next subsection. Constraints (4.4) and (4.5) make sure that the storage and
70
CPU satisfy the FL model requirements. Constraint (4.6) ensures that the edge
server stores the dataset that matches the FL model. Constraint (4.7) guarantees the
number of parameter server and FL worker of each model is 1 and κj , respectively.
Constraint (4.8) ensures that each edge server only trains one FL model and can
only play one role at one time. The decision variables and their ranges are given in
(4.9). With a nonlinear learning cost, this formulated optimization is a mixed-integer
nonlinear program (MINLP) problem, which is challenging to solve directly.
4.3.2 Cost Models

Our cost models consider four types of costs: global aggregation cost, local up-
date cost, edge communication cost, and PS initialization cost, as defined following,
respectively.
Edge Communication Cost: The edge communication cost mainly consists of
the FL model downloading and uploading costs. We denote by µj the uploaded and
downloaded model size for jth FL model mj . When uploading the FL model to the
parameter server or downloading the FL model from the parameter server, we use the
shortest path in the edge cloud to calculate the communication cost. Let ρj (vi , vk ) be
the communication cost of model mj from edge server vi to vk at time t, and it can
P µ t
be calculated by ρj (vi , vk ) = el ∈P t btj , where Pi,k is the shortest path connecting vi
i,k l
to vk at time t. For model mj , the total edge communication cost is

N X
X N
Cjcomm,t = 2· ϑtj xtk,j · yi,j
t
· ρj (vi , vk ).
k=1 i=1
Here, vi and vj are a worker and the PS of mj , respectively.

Local Update Cost: Let ψ(· ) be the function to define CPU cycles to process
t
the sample data Dj,i used by jth FL model and stored in edge server vi . So the all
local update cost for jth FL model in time period t is defined as
N t
X ψ(Dj,i )
Cjlocal,t = ϑtj · φtj · t
yi,j · t
.
i=1
fi
Global Aggregation Cost: Similarly, we use the ψ(· ) function to define CPU
cycles to process the aggregation step for the uploaded FL model.
N
X ψ(µj )
Cjglobal,t = ϑtj · xti,j · .
i=1
fit
71
Problem Decomposition
The Original Problem

$ $
Decompose to Three Solve !!,# , "!,# , and ##$ in
sub-problems (P1, P2, P3) P1,P2 and P3, respectively
$ $
Decompose to Two Solve (!!,# , "!,# ) and ##$ in
sub-problems (P4, P3) P4 and P3, respectively
Algorithm Design
Decisions: !!,#
Three-Stage Optimization $
Algorithm (THSO) Stage 1: Given "!,# , ##$ , solve !!,#
$
;
$ $ $
Stage 2: Given !!,# , ## , solve "!,# ;
Three-Stage $
Stage 3: Given !!,# $
, "!,# , solve ##$
Greedy Algorithm (GRDY)
$
, "!,#
Two-Stage Optimization Stage 1: Given ##$ , solve (!!,#

$ $
, "!,# );
$
, ##$
Algorithm (TWSO) Stage 2: Given (!!,# , "!,# ), solve ##$

$ $
Figure 4.3: The problem decomposition and design of our proposed multi-stage
algorithms.
Initialization of Parameter Server: The parameter server needs to download

the FL model assigned to it at time t unless it has been the parameter server of the
t
same FL model in the last time period. Let vps (mj ) be the PS selected for model mj
at time t. Then the initialization or switching cost of the parameter can be calculated
as,

t−1
ηj , if t = 1 or vps (mj ) = N IL





Cjinit,t = 0, if vpst
(mj ) = vpst−1
(mj )



min{ηj , ρj (v t−1 (mj ), v t (mj ))}, otherwise.

ps ps
If the FL model is the first time to be trained or has not been updated last time
period, the selected parameter server has to download the model mj with cost ηj .
If the parameter server stays the same from the last time period, there is no cost.
Otherwise, the new parameter server needs to either download the model or transfer
the model from the previous server. Now, the total cost of jth FL model in time t is
given by
ϖjt = Cjcomm,t + Cjlocal,t + Cjglobal,t + Cjinit,t .
72
4.4 Our Proposed Methods
4.4.1 Three-Stage Methods

Recall that the formulated problem in Section 4.3.1 is a mixed-integer nonlin-
ear program (MINLP), which is challenging to solve directly. Now, we decompose
our original problem into three sub-problems and attack it via multiple iterations of
solving the decomposed sub-problems, as shown in Fig. 4.3.
Three-Stage Decomposition
The main idea is based on a three-stage decomposition. In each stage, we only

focus on solving only one of the decision variables of xti,j , yi,j
t
, and ϱtj when the other
two are fixed. We iteratively repeat these three stages until a certain specific condition
is satisfied.
Stage 1: Parameter Server Selection. Given a worker selection and a local con-
vergence rate, we aim to find a parameter server for each model to minimize the total
cost, i.e.,
W
X
P1: min ϖjt
j=1 (4.10)
s.t. (4.4), (4.7), (4.9)
Stage 2: FL Worker Selection. We take the latest parameter server selection and
fixed local convergence rate to select FL workers of each model to minimize the total
cost, i.e.,
W
X
P2 : min ϖjt
j=1 (4.11)
s.t. (4.5) − (4.9)
Stage 3: Local Convergence Rate Decision. With the latest PS and FL worker
selections, we can determine the optimal local convergence rate in order to minimize
the total cost.
W
X
P3 : min ϖjt
j=1 (4.12)
s.t. (4.9)
73
Algorithm 4 Three-Stage Optimization Method
t,0
2: Generate an random initial FL worker selection decision yi,j and local convergence
rate ϱt,0
j

4: repeat
5: Stage 1: Calculate xt,ι t,ι−1
i,j by solving P1 with fixed yi,j and ϱt,ι−1
j
t,ι
6: Stage 2: Calculate yi,j by solving P2 with fixed xt,ι t,ι−1
i,j and ϱj
7: Stage 3: Calculate ϱt,ι t,ι t,ι

j by solving P3 with fixed xi,j and yi,j . Let obj val be the
achieved objective value (total learning cost of all FL models)

10: xtj,i = xt,ι t t,ι t t,ι
j,i ; yk,i = yk,i ; ϱj = ϱj

13: ι=ι+1
15: t
return xti,j , yi,j and ϱtj
Three-Stage Methods
After we decompose the original problem into three sub-problems, we can solve
each sub-problem by either using the linear programming technique or greedy heuris-
tics. The basic idea shared by these methods is as follows. First, we randomly generate
t,0
FL worker selection decision yi,j and the local convergence rate ϱt,0
j , then solve the
optimization problem P1 to get parameter server selection decision xt,1

i,j . Next, given
the local convergence rate ϱt,0 t,1

j and the latest parameter server selection decision xi,j ,
t,1
we solve the problem of P2 to get FL worker selection decision yi,j . Last, based on
the latest xt,1 t,1
i,j and yi,j , we solve the problem of P3 to achieve the desired local con-
vergence rate ϱt,1

j . This process will be repeated until it satisfies a specific condition
(either no further improvement of the objective value of the optimization or reach-

ing the maximal iteration number). Algorithm 4 shows the three-stage optimization
method using linear programming technique with (PuLP)[99] solver.
74
Algorithm 5 Three-Stage Greedy Method
t,0
2: Generate an random initial FL worker selection decision yi,j and local convergence
rate ϱt,0
j

4: repeat
5: Stage 1: Pick the PS xt,ι
i,j for each FL model with minimal total cost with fixed
t,ι−1
yi,j and ϱt,ι−1
j
6: Stage 2: Calculate the total cost of each potential edge server for each FL
model, sort the list in ascending order and greedily select the first κj edge
t,ι
servers to get yi,j with the latest xt,ι t,ι−1
i,j and fixed ϱj
7: Stage 3: Calculate ϱt,ι

j by greedily decreasing the local convergence rate to get
a minimal total cost with the latest xt,ι t,ι

i,j and yi,j . Let obj val be the achieved
objective value (total learning cost of all FL models)


13: ι=ι+1
15: t
Algorithm 5 shows a three-stage greedy algorithm in which greedy heuristic meth-

ods are used to solve the three sub-problems. (a) For Stage 1, given a fixed worker
selection and local convergence rate, we simply select a parameter server for each
model with minimal cost. (b) For Stage 2, we calculate the total cost of each po-
tential edge server for each FL model and then sort the edge server list in ascending
order of the cost. We greedily select the top κj edge servers as FL workers for each
model. (c) In the last stage, we greedily decrease the local convergence rate in a
specific threshold to get the minimal total cost until it reaches the global convergence
rate. We repeat the above steps until the ending condition is met.
75
Note that during the first two stages of Algorithm 5, we need to select the PS
or workers for all models in a certain order. Obviously, the processing order of each
model may affect the final performance. By default, we simply process them in a first
come first serve mode, i.e., first find the solution for the model that arrives earlier.
Due to the heterogeneity of edge servers in the real edge cloud, some edge servers may
have more sufficient resources (storage and computing capacity) while others do not.
In such a resource-limited scenario, serving the more complex FL model first may
reduce the total completion cost of FL of all models. Therefore, we also introduce a
variation greedy method in which the FL models are sorted based on their model sizes
and we process the model based on a larger model first in both the first and second
stages of Algorithm 5. In this variation, the more complex FL model will first have
more chance to select more high-performance workers leading to a lower total cost. In
our experiments, we evaluated the impact of these two different processing orders. In
addition, other ordering methods can also be applied to our proposed method, such
as choosing the model that requests more resources first.
4.4.2 Two-Stage Methods

We can also combine the first two stages since both are with integer variables.
Then the optimization can be solved via a two-stage decomposition. Here, we separate
t
the integer variables (xti,j , yi,j ) and the continues variable ϱtj into two sub-problems,
as shown in Fig. 4.3.
Stage 1: Parameter Server and Worker Selection. Given the last local convergence
rate, we want to find an optimal decision for selecting parameter server and workers,
i.e.,
W
X
P4: min ϖjt
j=1 (4.13)
s.t. (4.4) − (4.9)
Stage 2: Local Convergence Rate Decision. This is the same with the third sub-
problem P3 in three-stage methods.
Here, we use an optimization solver (GEKKO)[9] to solve the sub-problem P4
since it is a non-linear problem with two integer variables. For P3, we still use the
PuLP solver. The detail of the two-stage method is given by Algorithm 6.
76
Algorithm 6 Two-Stage Optimization Method
2: Generate an random initial local convergence rate ϱt,0
j

4: repeat
5: Stage 1: Calculate xt,ι t,ι t,ι−1
i,j and yi,j by solving P4 with fixed ϱj
6: Stage 2: Calculate ϱt,ι t,ι t,ι

j by solving P3 with latest xi,j and yi,j . Let obj val be
the achieved objective value (total learning cost of all FL models)


12: ι=ι+1
14: t
4.4.3 Time Complexity

We now analyze the time complexity of each of the proposed algorithms. Here,
we assume that the time taken to solve P1, P2, P3, and P4 with N servers and
W models are T1 (N, W ), T2 (N, W ), T3 (N, W ) and T4 (N, W ), respectively. Given a
decision of xti,j , yi,j
t
and ϱtj , we can calculate the total learning cost with Tcost (N, W ).
Let ϵ be the step length of reducing the convergence rate in Stage 3 of Algorithm 5.
Then it is easy to prove the following theorem regarding the time complexity of all
proposed algorithms.
Theorem 1 The time complexity of Algorithms 4, 5, and 6 are bounded by O((T1 +

T2 + T3 ) · max itr), O((N + 1ϵ ) · Tcost · W · max itr), and O((T4 + T3 ) · max itr),
respectively.
Note in Algorithm 5, time complexity of Stage 1 and Stage 2 are bounded by O(N ·
Tcost · W ) and O( 1ϵ · Tcost · W ), respectively.
77
16
4
3
14
12
17
20 8 11
1
9
6 15
7
19
2
13 18
10
5
Figure 4.4: An example of edge cloud topology.
4.5 Performance Evaluation

In this section, we present our experimental setup and evaluate the performance
of our proposed methods via simulations.
4.5.1 Environment Setup

Edge Cloud: In our edge computing environment, we adopt different random
topologies consisting of 20 ∼ 40 edge servers where the distribution of servers is based
on the real-world EUA-Dataset [41]. This dataset is widely used in edge computing
and contains the geographical locations of 125 cellular base stations in the Melbourne
central business district area. Fig. 4.4 illustrates one example of topology used in
our simulations. In each simulation, a certain number of edge servers are randomly
selected from the dataset. Each edge server has a maximal storage capacity ci , CPU
frequency fi and link bandwidth bj in range 512 ∼ 1, 024GB, 2 ∼ 5GHz, and 512 ∼
1, 024Mbps, respectively. We consider O = 5 different data types (e.g., image, audio,
and text) where the size Si,k is in the range 1 ∼ 3GB. Each type of data has been
distributed in different edge servers and one edge server may store more than one
type of data. Furthermore, the total number of time periods T is set to 30.
Federated Learning Models: To verify the performance of the federated learn-
ing process, we conduct a set of federated learning experiments. We assume that there
are W different FL tasks (vision, audio, text, or data) running in our environment
simultaneously. The number of FL workers κj required by each model is in the range
78
Table 4.1: Parameters Setting for Edge Cloud and FL
Parameter Value or Range
Edge Cloud Parameter
# of edge servers N 20 ∼ 40
vi ’s storage capacity ci 512 ∼ 1, 024GB
vi ’s CPU frequency fi 2 ∼ 5GHz
ei ’s link bandwidth bi 512 ∼ 1, 024Mbps
# of different dataset O 5
each dataset size |Si,k | 1 ∼ 3GB
# of time period T 30
Federated Learning Parameter
# of FL models W 1∼5
# of mj ’s FL workers κj 1∼7
mj ’s model size µj 10 ∼ 100MB
mj ’s CPU requirement χj 1 ∼ 3GHz
mj ’s downloading cost ηj 1∼5
mj ’s global convergence reqs. ςj 0.001 ∼ 0.1
constant FL variables ϑ0 and φ0 15, 4
3 ∼ 7. Each FL task has a specific model size µj , CPU requirement χj , and download
cost ηj in range 10 ∼ 100MB, 1 ∼ 3GHz, and 1 ∼ 5, respectively. The global conver-
gence requirement and the two constant variables are set based on [35]: ςj = 0.001,
ϑ = 15, and φ = 4. Three classical datasets in scikit-learn 1.0.2 [81] are used to train
linear regression (LR) models: California Housing dataset, Diabetes dataset, and ran-
domly generated LR datasets. Each LR model is trained with the loss of Mean Square
Error (MSE). In addition, we are interested in the performance of the proposed meth-
ods in non-convex loss functions. Hence, three different types of datasets are used
for these FL tasks: Fashion-MNIST (FMNIST)[120], Speech Commands[114], and
AG NEWS[138]. Each of them is trained with a CNN model.
We assign random data samples of these three datasets to clients in such a way
that each client has a different number of training and testing data. The Python
library PyTorch (v1.10) is used to build the model. All experiments are tested on
a Linux workstation including 16 CPU cores and 512GB of RAM, and 4x NVIDIA
Tesla V100 GPUs interconnected with NVlink2. Detailed parameters of both edge
cloud and FL models are listed in Table 4.1.
79
Baselines and Metrics: We compare our proposed algorithms (three-stage opti-
mization (THSO), three-stage greedy (GRDY) and two-stage optimization (TWSO))
with four competitive methods:
• ROUND[35]: It selects the FL workers and the local convergence rate for each
model based on a randomized rounding method [35]. Since it does not consider
the PS selection, we use a random choice for PS at the beginning.
• RAND: It randomly generates the parameter server selection, the FL worker

selection decision, and the local convergence rate under certain constraints.
• DATA[16]: It selects the FL workers based on the fraction of data at the

servers and prefers the server with more data. Since it ignores the PS and local
convergence rate selection, we randomly determine them.
• LOCAL[51]: It selects its top workers that will complete the local training first
(based on estimation). Again random decisions are used for the PS and local
rate.
4.5.2 Evaluation Results

Via extensive simulations, we evaluate the performance of our methods mainly
focusing on their cost performances.
Performance Comparison - Total Learning Cost
We first investigate different algorithms with a different number of edge servers

and global convergence rates. We consider 3 FL models for three different types of
tasks (i.e., image classification, speech recognition, text classification) to be trained
simultaneously, where each FL model has requested 5 FL workers. Fig. 4.5(a) and (b)
show the results of two groups of simulations. In the first group, we set the number
of edge servers to be from 20 to 40, while fixing the global convergence rate at 0.001.
In the second group, we change the global convergence rate from 0.001 to 0.1 with 30
edge servers. We have the following observations.
First, clearly for both sets of simulations, our proposed three algorithms (TWSO,
THSO, GRDY) have better performance than the other four benchmarks in terms of
80
6000 TWSO
4000 TWSO
THSO 3500 THSO
Average Total Cost
Average Total Cost

5000 ROUND
GRDY 3000
ROUND
GRDY
RAND RAND
4000 DATA 2500 DATA
LOCAL 2000 LOCAL
3000 1500
2000 1000
500
20 25 30 35 40 0.00 0.02 0.04 0.06 0.08 0.10
Number of Edge Servers Global Convergence Rate
(a) number of servers (b) global convergence rate
1600 1800
Communication Cost Sequentially
1400 Local Cost Jointly
1700
Average Total Cost

1200 Global Cost
Average Cost
1000 1600
800
600 1500
400
1400
200
0 1300
TWSO THSO ROUND GRDY RAND DATA LOCAL TWSO THSO
Strategy Strategy
(c) detailed costs (d) single vs multiple models
Figure 4.5: Performance comparison with different metrics.
the average total learning cost. Better performances than ROUND (which focuses on
worker selection and learning rate optimization) confirm the advance of our method
by considering PS selection in the joint optimization. Better performances of our
methods and ROUND than DATA and LOCAL (which only focus on worker selection)
show the advantage of joint optimization. In all simulations, RAND has the worst
performance since it does not take any optimization.
Second, as shown in Fig. 4.5(a), the average total cost of every algorithm decreases
first and increases again as the number of edge servers increases. Initially, with more
edge servers, better chances to find a good solution to minimize the total cost of all
FL models. On the other hand, the further larger topology with more servers may
also begin to increase the average total cost due to larger transmission costs from
workers to PS.
Third, as shown in Fig. 4.5(b), as the global convergence rate increases, the aver-
age total cost decreases. This is reasonable since the larger global convergence rate
requests less local training or global update, which leads to lower total learning costs.
Fig. 4.5(c) also plots the detailed costs of different methods when 30 edge servers are
81
Average Communication Cost
TWSO 2000 TWSO
5000 THSO THSO
Average Total Cost

4000
ROUND 1500 ROUND
GRDY GRDY
3000 RAND RAND
DATA 1000 DATA
2000 LOCAL LOCAL
500
1000
1 2 3 4 5 1 2 3 4 5
Number of FL Models Number of FL Models
(a) average total cost (b) communication cost
1750 TWSO TWSO
THSO 2000 THSO
Average Global Cost

1500
Average Local Cost
ROUND ROUND
1250 GRDY 1500 GRDY
1000 RAND RAND
DATA 1000 DATA
750 LOCAL LOCAL
500 500
250
1 2 3 4 5 1 2 3 4 5
Number of FL Models Number of FL Models
(c) local update cost (d) global update cost
Figure 4.6: Impact of the number of FL models on costs.
considered and the global convergence rate is 0.001. It shows that the local cost dom-
inates the total cost, and consequently, GRDY has a higher total cost than TWSO
and THSO as seen in Fig. 4.5(a).
We also evaluate the effects of joint optimization over multi-models compared
with the separative optimization with only a single model. In the latter case, we still
use TWSO and THSO but force them only on a single FL model at once, and thus
sequentially choose the decision for each model. Again we train 3 FL models when 30
edge servers are considered and the global convergence rate is 0.001. Fig. 4.5(d) shows
the comparison of determining the choices for three FL models jointly or sequentially
with TWSO and THSO. We can clearly see the lower total cost when we jointly
optimize the decisions. This confirms the effectiveness of jointly determining the
selection decision for multiple FL models rather than sequentially determining the
decision for every model.
82
4000
Average Communication Cost

TWSO 1500 TWSO
THSO 1250 THSO
Average Total Cost

3000 ROUND ROUND
GRDY 1000 GRDY
RAND 750 RAND
2000 DATA DATA
LOCAL 500 LOCAL
1000 250
0
1
2 3 4 5 6 7 12 3 4 5 6 7
Number of FL Workers Number of FL Workers
(a) average total cost (b) communication cost
1500 TWSO TWSO

THSO 1400 THSO
Average Global Cost

Average Local Cost
1250 ROUND 1200 ROUND

1000 GRDY GRDY
RAND 1000 RAND
750 DATA 800 DATA
LOCAL LOCAL
500 600
250 400
1 2 3 4 5 6 7 1 2 3 4 5 6 7
Number of FL Workers Number of FL Workers
(c) local update cost (d) global update cost
Figure 4.7: Impact of the number of FL workers on costs.
Impact of FL Model Number
Next, we look into the impact of different numbers of FL models. We simulta-

neously run 1 to 5 FL models. The number of edge servers and the number of FL
workers are set to 30 and 5, respectively. The global convergence rate is also set to
0.001. As shown in Fig. 4.6(a), the more FL models, the more the average total learn-
ing cost. Our proposed algorithms still perform better than the other four methods.
TWSO and THSO still enjoy a little performance improvement than GRDY. We also
plot the detail of three types of costs (i.e., communication cost, local cost, and global
cost) in Fig. 4.6(b), (c), and (d). We can observe that the communication cost of
GRDY is similar to those of TWSO and THSO. However, GRDY has the highest
local cost and lowest global cost compared with other algorithms. That is because
GRDY greedily runs more local training so that the number of global updates can be
reduced while satisfying the expected global convergence rate.
83
2800
GRDY
GRDY-Max
2600
Total Cost
2400
2200
2000 1 2 3 4 5
Number of Max Iterations
Figure 4.8: Comparison of two different processing orders of FL models.
Impact of FL Worker Number
We further investigate the impact of a different number of FL workers. In this

simulation, we consider 30 edge servers and train 3 FL models while the global con-
vergence rate is 0.001 as well. Results are reported in Fig. 4.7. The results are similar
to those with a different number of L models. First, the average total cost of all al-
gorithms increases as the number of FL workers increases since the more FL workers,
the more resource-consuming. Second, the proposed algorithms have better perfor-
mance than ROUND, RAND, DATA, and LOCAL algorithms as shown in Fig. 4.7(a).
Last, GRDY has the highest local cost while having the lower communication cost
and global cost compared with other strategies as shown in Fig. 4.7(b)-(d).
Impact of Model Processing Order in GRDY
Remember that in GRDY (Algorithm 5) we need select PS and workers for each
model following certain processing orders among FL models. We now study the
impact of different processing orders in GRDY. We test on two specific processing
orders: the default one with First-in-First-Serve (GRDY) and the variation in which
priority is given to the model with a larger size (GRDY-Max). The experiments run
under the edge cloud with 30 edge servers that have limited resources and significant
differences. We run 20 different cases in each different number of max iterations and
Fig. 4.8 shows the experiment result. First, as the number of max iterations increases,
the total cost of both two greedy algorithms decreases since it has more chances to
find a better solution with a lower cost. However, the improvement becomes smaller
84
1.0 1.0
j=6
0.9
j=7
0.5 0.8 j=8
Training Loss
R2 Score
0.7 j=9
0.0 j = 10
0.6
0.5 LR Model 1 0.5
LR Model 2
LR Model 3 0.4
1.0
6 7 8 9 10 0.3 0 20 40 60 80 100
Number of Workers Iterations
(a) R2 score of 3 LR models (b) loss over CA Housing
20000 3000
j=6 j=6
j=7 2500 j=7
15000 j=8 j=8
Training Loss
Training Loss
2000
j=9 j=9
j = 10 1500 j = 10
10000
1000
5000 500
0 20 40 60 80 100 0 0 20 40 60 80 100
Iterations Iterations
(c) loss over diabetes (d) loss over the random dataset
Figure 4.9: Training loss with LR models/tasks and the impact of FL workers.
when the max iteration further increases. Second, under the resource-limited scenario,
GRDY-Max performs better than GRDY in almost all cases. This result confirms
the necessity and superiority of selecting an optimal processing order in different edge
scenarios. In addition, we need to select an appropriate max iteration to control the
convergence speed of our greedy algorithms.
FL Training Loss and Accuracy
Fig. 4.9 shows the training loss of our method in real-world federated learning
experiments over LR datasets. We introduce the R2 score metric to evaluate the
performance of LR model (convex) training. R2 score is the proportion of the variance
in the dependent variable that is predictable from the independent variable(s). In this
set of experiments, we concurrently train 3 LR models with 3 different datasets. Each
dataset is split into 10 edge servers unequally (i.e. non-IID setting) and the number
of global training rounds is 100. We can see from Fig. 4.9(b)-(d), the training loss
decreases as the number of workers (κj ) increases for each model. Fig. 4.9(a) shows
85
0.8
0.8
Training Accuracy
Training Accuracy
0.7
0.7 0.6 j=3
j=4
0.5
0.6 FMNIST j=5
SpeechCommand 0.4 j=6
0.5 AG_News 0.3 j=7
0 100 200 300 0 100 200 300
(a) accuracy of 3 FL models (b) accuracy on FMNIST
0.850
0.825 0.8
Training Accuracy
Training Accuracy
0.800 j=3 0.7 j=3
j=4 j=4
0.775 j=5 0.6 j=5
0.750 j=6 j=6
j=7 0.5 j=7
0.725 0 100 200 300 0 100 200 300
(c) accuracy on Speech Com (d) accuracy on AG News
Figure 4.10: Training accuracy with three FL tasks and the impact of FL workers.
the R2 score of all LR models. Obviously, with more workers, the R2 score of all
models increases, which means all models are well-regressed. However, model 2 has
a worse R2 score (a negative value) in fewer workers due to the small size of the
training dataset. But as the number of workers increases, the performance of model
2 becomes better.
Fig. 4.10 also reports the learning accuracy of our method on more complex FL
tasks with different numbers of workers (due to space limitations, we only show the
one from THSO). Here, the datasets of three FL models (image classification, speech
recognition, text classification) are split into 30 partitions and the number of the
global update is set to 300. Fig. 4.10(a) shows the training accuracy of all three FL
models increases with the increasing number of iterations. Fig. 4.10(b)-(d) shows
the detailed training accuracy of three different models with different numbers of FL
workers. We can observe that with more FL workers, the training accuracy of all
models can reach a higher value. However, when comparing the result in Fig. 4.7, the
more FL workers, the more total cost consumed. Hence, there is a trade-off between
86
the training accuracy and the total cost. Another interesting observation is that for
FMNIST and Speech Command the accuracy increases with more FL workers, but
for AG News the accuracy is similar or the difference is very minimal. This may
be due to the simplicity of the AG News learning task. In summary, one needs to
consider the trade-off between the training accuracy and the total cost. If you need
more FL workers, it will incur more total cost but get higher training accuracy, and
vice versa.

In this section, we briefly review recent studies on federated learning over edge sys-
tems [54]. Federated learning [56, 108, 76, 58, 35, 66, 113, 116, 74] has been emerging
as a new distributed ML paradigm over different edge systems. Current FL frame-
works can be categorized into three types based on the learning topology used for
model aggregation: centralized FL (CFL), hierarchical FL (HFL), and decentralized
FL (DFL). CFL is the classical FL [35] where the parameter server (PS) and several
workers form a star architecture as shown in Fig. 4.1. Wang et al. [108] analyzed the
convergence of CFL in a constrained edge computing system and proposed a control
algorithm that determines the best trade-off between local update and global param-
eter aggregation to minimize the loss function. This work focused on the convergence
and adaptive control of FL and did not consider participant selection.
Nishio and Yonetani [76] studied a client selection problem in CFL in mobile edge
computing. Their method used an edge server in the cellular network as the PS and
selected a set of mobile clients as workers. Their client selection aimed to maximize
the selected workers while meeting the time constraints. Jin et al. [35] considered a
joint control of FL and edge provisioning problem in distributed cloud-edge networks
where the cloud server is the PS and active edge servers are workers. Their method
controlled the status of edge servers for training to minimize the long-term cumulative
cost of FL and also satisfied the convergence of the trained model. Li et al. [52] also
considered client scheduling in FL to overcome client uncertainties or stragglers via
learning-based task replication. While these works are similar to ours, they focus
on the optimization of one global FL model instead of multiple FL models. More
importantly, these works do not consider PS selection for multiple FL models.
87
Recently, Nguyen et al.[74] studied resource sharing among multiple FL ser-
vices/models in edge computing where the user equipment is used as an FL worker,
and proposed a solution to optimally manage the resource allocation and learning
parameter control while ensuring energy consumption requirement. However, their
FL framework is different from ours. First, they use the user equipment as FL work-
ers, while we use edge servers for FL workers. Second, they do not consider the PS
selection, since they use a single edge server as the PS. Third, their model allows
to train multiple FL models at the same user equipment (while we do not allow the
edge server acts as workers for multiple models in the same time unit), and thus their
method has to manage the CPU and bandwidth allocation on the equipment.
Both [56] and [58] considered a client-edge-cloud hierarchical federated learning
(HFL) where cloud and edge servers work as two-tier parameter servers to aggregate
the partial models from mobile clients (i.e. FL workers). Liu et al. [56] proved the
convergence of such an HFL, while Luo et al. [58] also studied a joint resource alloca-
tion and edge association problem for device users under such an HFL framework to
achieve global cost minimization. Wang et al. [113] considered the cluster structure
formation in HFL where edge servers are clustered for model aggregation. Recently,
Wei et al. [116] also studied the participant selection for HFL in edge clouds to
minimize the learning cost. However, our FL framework does not use HFL.
Meng et al. [66] focused on model training of DFL using decentralized P2P meth-
ods in edge computing. While their method also selected FL workers from an edge
network, the model aggregation was performed at edge devices based on a dynam-
ically formatted P2P topology (no PS). Therefore, it is different from our studied
problem, which mainly focuses on CFL.
There are also other works [101, 131, 48] where energy efficiency and/or wireless
communication have been taken into consideration in FL in edge systems.
4.7 Chapter Summary

In this chapter, we mainly focus on multi-model FL over an edge cloud and care-
fully select participants (both PS and workers) for each model by considering the
resource limitation and heterogeneity of edge servers as well as different data distri-
butions. We formulate a joint participant selection and learning optimization problem
88
to minimize the total FL cost of multiple models while ensuring their convergence
performance. We propose three different algorithms to decompose the original prob-
lem into multi-stages so that each stage can be solved by an optimization solver or
a greedy algorithm. Extensive simulations with real FL experiments show that our
proposed algorithms outperform similar existing solutions.
89
CHAPTER 5
QUANTUM-ASSISTED SCHEDULING
ALGORITHMS
5.1 Introduction
With the advancement of technology, quantum computing (QC) has gained much
attention due to the realization of speedups offered by quantum techniques for com-
plex computational problems. This has resulted in transformational breakthroughs
on specific tasks accomplished with near-term quantum computers. QC has more
computational power than classical computers and may be faster at solving com-
plex optimization problems, e.g., random quantum circuit sampling [6], Gaussian
boson sampling [143], and combinatorial optimization [77, 89, 5]. In this paper, by
leveraging the parallel computing capability of QC, we focus on designing a new
quantum-inspired scheduling algorithm to solve a complex joint participant selection
and learning scheduling problem for federated learning (FL) in distributed networks.
FL is a distributed artificial intelligence (AI) approach that allows for the training
of high-quality AI models by aggregating local updates from multiple FL clients (or
workers), such as IoT devices, without direct access to the local data [64, 33, 91, 116].
This potentially prevents the disclosure of sensitive user information and preferences,
reducing the risk of privacy leakage. Nevertheless, when deploying the FL framework
in distributed networks, there are two challenges. First, the computing power and
network resources of servers, as well as their data distribution, are diverse. Some
low-performance servers may cause the convergence process to slow down and reduce
training performance. Furthermore, dispersed computing resources and high network
latency may result in high training costs. Second, in the practical scenario, concur-
rently training multiple models in the shared distributed network creates competition
for computing and communication resources. As shown in Fig. 5.1, two FL models are
trained concurrently and each FL model requires one PS and three workers for model
training. In this case, which FL model is preferentially served at which server directly
affects the total training cost of all FL models. To this end, appropriate participant
90
Worker 1 PS 1 Worker 2
Worker 2 Worker 3
PS 2
Worker 1
Worker 3
Distributed Server Model broadcasting & global aggregating
FL Models Local computation
Figure 5.1: The training process of distributed federated learning.
selection and learning scheduling decision are fairly crucial for the multi-model FL
training scenario.
As a result, we concentrate primarily on the problem of joint participant selection
and learning scheduling in multi-model FL training scenarios. It should be noted
that in distributed networks, any server can serve as both a PS and a client, and
that participant selection includes selecting both the PS and clients for each FL
model. For clarity, we refer to a client as an FL worker. It is worth noting that both
participant (client) selection and learning scheduling problems have been studied in
FL using classical computers recently [76, 35, 108]. However, most existing works
focus on optimizing a single global FL model rather than multiple FL models. More
importantly, none of these works take into account the PS selection for multiple FL
models. Recently, Wei et al. [117] considered a joint participant selection and learning
scheduling problem in multi-model federated edge learning, and proposed multi-stage
methods to solve the joint optimization problem. However, due to the nature of
the formulated optimization as a mixed-integer non-linear program (MINLP), the
proposed methods may not lead to optimal solutions and do not scale well when the
problem grows more complex.
To address the aforementioned issue, quantum computing has recently emerged
as a powerful optimization tool [77, 89, 5]. Such approaches, however, may not be
91
competitive until the shortcomings of QC, such as the limited number of qubits,
are overcome by further technological advancements. To that end, several hybrid
quantum-classical solutions [102, 3] have been proposed to tackle optimization prob-
lems by leveraging the complementary strengths of quantum and classical comput-
ers. Inspired by the pioneers, we attempt to solve our joint participant selection
and learning scheduling problem by the hybrid quantum-classical optimization ap-
proach combined with decomposition techniques. Such an approach enables us to
utilize the capabilities of both quantum and classical computers fully. In addition,
D-wave stands out in the quantum computing market because it offers the quantum
annealer computer with the most qubits of all the candidates. With D-wave’s quan-
tum annealer computer, one can solve an integer linear programming (ILP) problem
by converting it into a quadratic unconstrained binary optimization (QUBO) model,
which is inspired by the Ising model. As a result, we attempt to develop novel hybrid
quantum-classical algorithms on the D-Wave’s quantum computer.
Three research challenges exist in developing efficient hybrid quantum-classical
techniques with decomposition schemes. First, how to convert our original MINLP
problem into an ILP problem and even further convert it into a QUBO model as an
input to the D-Wave’s quantum computer? Second, how to design a novel hybrid
quantum-classical strategy that solves the corresponding problem in fewer iterations?
Third, how to derive an efficient number of integer cuts that iteratively reduce the
search space and accelerate the convergence of the hybrid quantum-classical methods?
To handle the above challenges, we develop two novel hybrid quantum-classical algo-
rithms to demonstrate the potential of such hybrid approaches.
5.2 System Model and Problem Formulation

In this section, we introduce the system model and the optimization problem.
5.2.1 System Model

The distributed network connecting all computing servers is modeled as a graph
G(V, E), where V = {v1 , · · · , vN } and E = {e1 , · · · , eL } are the sets of N servers and
L direct connection links, respectively. Generally, each server vi owns a specific stor-
92
age capacity sci and CPU frequency sfi while each link ej has an available bandwidth
bj . Each server holds a distinct set of datasets and can be used for local model train-
ing. We assume that each server can hold multiple types of datasets for FL training
and the dataset used by the j-th FL model in the i-th server is denoted by Di,j . In
this paper, we focus on the participant selection based on computing/communication
resources in the distributed network and do not consider training data distributions
(which is another important research topic and orthogonal to our research).
5.2.2 Federated Learning Model

We assume that parallel FL was conducted where multiple models are being
trained concurrently in the network. We consider a classical FL process that con-
sists of a PS and multiple workers. Instead of using a single centralized server as
the PS of all models, we select a group of servers distributed in the network with
enough capacity as its participants to jointly train the FL model. Assuming that W
FL models (M = {m1 , · · · , mW }) are trained concurrently and each FL model will
request certain requirements for the training task, i.e.
1. κj + 1 servers as participants including one PS and κj workers, whose CPU and

storage capacity should be larger than its required minimal CPU frequency χj
and model size µj respectively;
2. the achieved global convergence rate needs to be larger than ςj .
We further assume that each server can only play a role as either the PS or the worker
for any FL model at one time.
The training process of each FL model includes three stages: (a) initializing and
broadcasting the global model of mj to each participant; (b) each worker performs the
local model computation using its own dataset; and (c) aggregating the local models
from workers, as illustrated in Fig. 4.2 and detailed in the sequel.
Stage 1: Global Model Initialization. In Stage 1, we initialize the global model
parameter for each FL model as ωj and send the global model parameter to each
selected participant.
Stage 2: Local Model Computation. Let the local model parameters of model
mj on the server vi be ωi,j and the loss function on a training data sample s be
93
fi,j (ωi,j , dxs , dys ), where dxs is the input feature and dys is the required label. Then
the loss function on the whole local dataset of vi is defined as
1 X
Fi,j (ωi,j ) = fi,j (ωi,j , dxs , dys ). (5.1)
|Di,j | s∈D
i,j
Generally, FL will perform round by round and we denote the total number of global
aggregation, and local updates as α̂ and β̂, where α and β are their indexes, respec-
tively. In the α-th round, each worker runs a number of local updates to achieve a
local convergence accuracy ϱj ∈ (0, 1). At the β-th local iteration, each worker follows
the same local update rule as
α,β α,β−1 α,β−1

ωi,j = ωi,j − η∇Fi,j (ωi,j ), (5.2)
where η is the learning rate of the loss function. This process will run until
α,β̂ ∗ α,0 ∗
Fi,j (ωi,j ) − Fi,j ≤ ϱj [Fi,j (ωi,j ) − Fi,j ]. (5.3)
α,0
Here, we set ωi,j = ωj .
Stage 3: Global Aggregation. At this stage, one participant has to be chosen as
α,β̂
the PS. After β̂ local updates, all workers send their local model parameter ωi,j to
the PS. The PS performs FedAvg to aggregate the global model parameters as
X Di,j α−1,β̂
ωjα = ωi,j , (5.4)
i∈Sj
Dj
S
where Dj = i∈Sj Di,j is the total number of data sample from κj workers and Sj is
the set of selected workers. The global convergence of the global model is defined as
Gj (ωjα̂ ) − Gj∗ ≤ ςj [Gj (ωj0 ) − Gj∗ ], (5.5)
where Gj∗ is the global optimum of FL model mj .

Finally, from (5.3) and (5.5), in order to achieve the desired local convergence rate
ϱj and global convergence rate ςj , we need to calculate the number of local updates
β̂ = φj and the number of global iterations α̂ = ϑj .
From the above observation, we can find that the global convergence rate ςj for
each FL model can be predefined according to the requirement of model training
requester and we have to conduct the local update and global iteration to achieve
94
that. Then we have the following relationship between the convergence rate and the
local update as well as global iterations [35, 36, 14, 131, 60, 92].
2λ2

1 1 1 1
ϑj ≥ 2 ln ≜ ϑ0 ln , (5.6)
γ ξ ςj 1 − ϱj ςj 1 − ϱ j

2 1 1
φj ≥ log2 ≜ φ0 log2 , (5.7)
(2 − λδ)δγ ϱj ϱj
γ 2
where ξ and δ are two variables in ranges (0, λ
] and (0, L
), respectively. λ is the
λ-Lipschitz parameter and γ is the γ-strongly convex parameter. Both the value of
λ and γ are determined by the loss function. ϑ0 and φ0 are two constants where
2λ2 2
ϑ0 = γ2ξ
and φ0 = (2−λδ)δγ
.
5.2.3 Cost Model

Our cost model consists of four parts: transmission cost, local update cost, global
aggregation cost, and participant cost, defined as follows.
Transmission Cost: The transmission cost mainly consists of the FL model
downloading and uploading costs. Denote µj by the uploaded and downloaded model
size of FL model mj . We leverage the shortest path in the distributed network to
calculate the transmission cost when downloading models from the PS or uploading
models to the PS. Let ρj (vi , vk ) be the transmission cost of model mj from server vi
P µ
to vk , and it can be calculated by ρj (vi , vk ) = el ∈Pi,k blj , where Pi,k is the shortest
path connecting vi to vk . For model mj , the total transmission cost is Cjtrans =
2· ϑj N
P PN
k=1 i=1 xk,j · yi,j · ρj (vi , vk ). Here, vi and vk are a worker and the PS of mj ,
respectively. In addition, xi,j or yi,j are the decision variables on whether to select
server vi as a parameter server or an FL worker for the j-th FL model.
Local Update Cost: Let ψ(· ) be the function to define CPU cycles to process
the sample data Dj,i used by the j-th FL model and stored in server vi . So the all
ψ(Dj,i )
local update cost for the j-th FL model is defined as Cjlocal = ϑj · φj · N
P
i=1 yi,j · sfi .
Global Aggregation Cost: Similarly, the global aggregation step for the up-
ψ(µj )
loaded FL models is defined as Cjglobal = ϑj · N
P
i=1 xi,j · sfi .
Participant Cost: Each participant of FL model mj will be paid a basic rental

cost for utility management which is related to their CPU frequency. Let pj be the
unit price for a CPU unit, accordingly, the participant cost for jth FL model is defined
as Cjrent = N
P
i=1 (xi,j + yi,j )· pj · sfi .
95
5.2.4 Problem Formulation
Under the previously introduced multi-model federated learning scenario, we con-
sider how to choose participants for each of the models and how to schedule their
local/global updates. Recall that we assume that only one PS and κj workers se-
lected for one model, i.e., M
P PM
i=1 xi,j = 1 and i=1 yi,j = κj . We use ϱj ∈ [0.01, 0.99]
to represent the maximal local convergence rate of mj . We will use ϱj and ςj to

control the number of global iterations and local updates for model mj . Note that ςj
is given by the model mj as a requirement. Overall, xi,j , yi,j and ϱj are the decision
variables of our optimization. We now formulate our participant selection and learn-
ing scheduling problem for FL in the distributed network where we need to select
the parameter server and workers for each model as well as achieve the desired local
convergence rate. The objective of our problem is to minimize the total learning cost
of all FL models as follows.
W
X
min (Cjtrans + Cjlocal + Cjglobal + Cjrent ) (5.8)
x,y,ρ
j=1
s.t. xi,j µj κj ≤ sci , xi,j χj ≤ sfi , ∀i, j, (5.8a)

yi,j µj ≤ sci , yi,j χj ≤ sfi , ∀i, j, (5.8b)
N
X N
X
xi,j = 1, yi,j = κj , ∀j, (5.8c)
i=1 i=1
W
X
(xi,j + yi,j ) ≤ 1, ∀i, (5.8d)
j=1
i ∈ (1, . . . , N ), j ∈ (1, . . . , W ), (5.8e)

xi,j ∈ {0, 1}, yi,j ∈ {0, 1}, (5.8f)
ϱj ∈ [0.01, 0.99]. (5.8g)
Constraints (5.8a) and (5.8b) make sure that the storage and CPU satisfy the re-
quirements from the FL model. Constraint (5.8c) guarantees the number of PS and
FL workers of each model is 1 and κj , respectively. Constraint (5.8d) ensures that
each server only trains one FL model and can only play one role at one time. The de-
cision variables and their ranges are given in (5.8e)-(5.8g). Note that the formulated
problem (5.8) is a non-linear mixed-integer program, which is NP-hard in general and
challenging to solve with classical computing.
96
Master Problem – IP Problem
Solve, !!,# and #!,# via format & solve QUBO problem
Benders' Decomposition
QPU
no
The Original Problem

reach threshold
yes or max_itr ?
an optimality or
feasibility cut
{"! , $! }
!!,# , #!,# and 5#
CPU
Worker 1 PS 1 Worker 2
Subproblem – LP Problem
Worker 2 Worker 3
PS 2
Solve ($,!,# , )!,# *!,# , +# and ,# via classical solver

Worker 1
Worker 3 Proposed HQCBD Method
Figure 5.2: The proposed HQCBD framework.
5.3 Hybrid Quantum Assisted Benders’ Decompo-

sition (HQCBD) Methods
Motivated by the advances in QC, we decouple the original problem into a mas-
ter problem and a subproblem by leveraging Benders’ Decomposition [142, 21] and
solving them using quantum and classical methods, respectively. Fig. 5.2 shows the
framework of our proposed HQCBD.
5.3.1 Benders’ Decomposition

We first briefly introduce the basic idea of Benders’ Decomposition. Benders’
Decomposition is a useful algorithm for solving convex optimization problems with a
large number of variables. It works best when a large problem can be decomposed into
two (or more) smaller problems that are individually much easier to solve [142]. At
a high level, the procedure will iteratively solve the master problem and subproblem.
Each iteration provides an updated upper and lower bound on the optimal objective
value. The result of the subproblem either provides a new constraint to add to the
master problem or a certificate that no finite optimal solution exists for the problem.
The procedure terminates when it is shown that no finite optimal solution exists or
when the gap between the upper and lower bound is sufficiently small[85].
97
We reformulate our original problem (5.8) by extracting all constant variables and
further introduce additional continuous variables uj and wj to replace ϱj as below.
W
X N X
X N N
X N
X
min [uj · a1,i,j,k · xk,j · yi,j + wj · a2,i,j · yi,j + uj · a3,i,j · xi,j
x,y,u,w
j=1 k=1 i=1 i=1 i=1
N
X
+ a4,i · (xi,j + yi,j )] (5.9)
i=1
s.t. (5.8a) − (5.8g),

b1 ≤ uj ≤ b2 , (5.9a)
b3 ≤ w j ≤ b4 , (5.9b)
where the four sets of constant variables are a1,i,j,k = 2ϑ0 ln( ς1j )· ρj (vi , vk ), a2,i,j =
ψ(Dj,i ) ψ(µj )
φ0 ϑ0 ln( ς1j )· fi
, a3,i,j = ϑ0 ln( ς1j )· fi
, and a4,i = δfi . Also, uj = 1
1−ϱj
, wj =
uj
uj log2 ( uj −1 ), b1 = 1.01, b2 = 100, b3 = 1.435 and b4 = 6.725. Note that Problem (5.9)
consists of several terms that are the products of integer and continuous variables,
e.g. uj · xk,j · yi,j , wj · yi,j . Hence, we further introduce variables ok,i,j , pi,j , and qi,j to
represent the product of an integer variable and a continuous variable as below.
W X
X N
N X N
X N
X N
X
min [ a1,i,j,k · ok,i,j + a2,i,j · pi,j + a3,i,j · qi,j + a4,i · (xi,j + yi,j )]
x,y,u,w,o,p,q
j=1 k=1 i=1 i=1 i=1 i=1
(5.10)
s.t. (5.8a) − (5.8g), (5.9a), (5.9b),
b1 xk,j yi,j ≤ ok,i,j ≤ b2 xk,j yi,j , (5.10a)
uj − ok,i,j ≤ b2 (1 − xk,j yi,j ), (5.10b)
uj − ok,i,j ≥ b1 (1 − xk,j yi,j ), (5.10c)
b3 yi,j ≤ pi,j ≤ b4 yi,j , (5.10d)
wj − pi,j ≤ b4 (1 − yi,j ), (5.10e)
wj − pi,j ≥ b3 (1 − yi,j ), (5.10f)
b1 xi,j ≤ qi,j ≤ b2 xi,j , (5.10g)
uj − qi,j ≤ b4 (1 − xi,j ), (5.10h)
uj − qi,j ≥ b3 (1 − xi,j ). (5.10i)
98
So far, we have linearized the product of binary and continuous variables as
(u, w, o, p, q), and therefore we can apply Benders’ Decomposition. In problem (5.10),
for each possible choice x̄ and ȳ, we find the best choices for u, w, o, p, q by solving
a linear program. So we regard u, w, o, p, q as a function of x, y. Then we replace
the contribution of u, w, o, p, q to the objective with a scalar variable representing the
value of the best choice for a given x̄ and ȳ. We start with a crude approximation
to the contribution of u, w, o, p, q and then generate a sequence of dual solutions to
tighten up the approximation. In addition, the problem (5.10) can be rewritten as a
general form as follows.
min c⊺ X + h⊺ Y (5.11)
X,Y
s.t. A1 X = a1 , (5.11a)
A2 X ≤ a2 , (5.11b)
BX + GY ≤ a3 , (5.11c)
X = [x, y]⊺ , X ∈ X, (5.11d)
Y = [u, w, o, p, q]⊺ , Y ∈ Y. (5.11e)
where c and h are coefficients for binary and continuous variables in the objective
function, respectively. A1 , A2 , B, G are coefficients in the constraints while a1 , a2 and
a3 are constant vectors.
Next, we will detail the formulation of the corresponding subproblem (LP prob-
lems) and master problem (an integer programming (IP) problem) after the Benders’
Decomposition.
5.3.2 Classical Optimization for Subproblem

Based on the decomposition in Section 5.3.1, the subproblem is defined as follows.
N X
X W X N
min ( a1,i,j,k · ok,i,j + a2,i,j · pi,j + a3,i,j · qi,j ) (5.12)
u,w,o,p,q
i=1 j=1 k=1
s.t. (5.9a), (5.9b), (5.10a) − (5.10i).
99
The general form of the subproblem can be further represented as follows.
Subproblem: min h⊺ Y (5.13)

Y
s.t. − GY ≥ BX − a3 , (5.13a)
Y = [u, w, o, p, q]⊺ , Y ∈ Y. (5.13b)
In addition, the dual problem of the subproblem is defined below and π is the
dual variable.
max (BX − a3 )⊺ π (5.14)

π
s.t. − G⊺ π ≤ h, (5.14a)
π ≥ 0, (5.14b)
This problem can be directly solved by a classical LP solver in a classical CPU

computer, such as Scipy [105] or Gurobi [27].
5.3.3 Quantum Formulation for Master Problem

Based on the dual problem of the subproblem, the master problem in a general
form can be defined below.
Master: min c⊺ X + λ (5.15)

X
s.t. A1 X = a1 , (5.15a)
A2 X ≤ a2 , (5.15b)
λ ≥ λdown , (5.15c)
λ ≥ (BX − a3 )⊺ π k , ∀k ∈ K̂, (5.15d)
X = [x, y]⊺ , X ∈ X. (5.15e)
where λ is the optimal value of the subproblem at the current iteration. Constraints
(5.15c) is the feasible lower bound of the subproblem and (5.15d) is the correspond-
ing Benders’ cut, where K̂ is the stored index set of feasibility cuts from previous
iterations.
QUBO Formulation. Quantum annealers are able to solve the optimization
problem in a QUBO formulation. To leverage the state-of-art quantum annealers
100
provided by D-Wave, the master problem has to be converted to the corresponding
QUBO formulation. Due to the rule of QUBO setup, we have to reformulate our
constrained master problem as the unconstrained QUBO by using penalties. The
basic idea is to find the best penalty coefficients of the constraints. Following the
principle of constraint-penalty pairs in [23], the constraints are converted as follows.
(5.15a) ⇒ ξ1 : P 1 (A1 X − a1 )2 ,
l̄ 2
X
2
(5.15b) ⇒ ξ2 : P (A2 X − a2 + 2l s2l )2 ,
l=0
where ¯l2 = ⌈log2 (a2 − A2 X)⌉.

l̄ 3
X
(5.15c) ⇒ ξ3 : P 3 (λdown − λ + 2l s3l )2 ,
l=0
where ¯l3 = ⌈log2 (λ − λdown )⌉.

l̄ 4
X
⊺ l
4
(5.15d) ⇒ ξ4 : P ((BX − a3 ) π − λ + 2l s4l )2 ,
l=0
¯4
where l = ⌈log2 [λ − min(BX − a3 )⊺ π l ]⌉.
X,π
Here, P ∗ is the predefined penalty vector when the corresponding constraint is vio-
lated. s∗ is a binary slack variable and ¯l∗ is the upper bound of the number of slack
l
variables. Then, the reformulated unconstrained master problem is defined as
max c⊺ X + λ + ξ1 + ξ2 + ξ3 + ξ4 . (5.16)
X
Variable Representation. Now consider the problem (5.16), it is still not the
QUBO formation due to the existence of the continuous variable λ. Thus, we need to
represent the continuous variable λ using binary bits. We use a binary vector w with
the length of M bits to replace continuous variable λ and denote it as a new discrete
number λ̂ ∈ Q. In general, λ̂ requires the binary numeric system assigning M bits to
replace continuous variable λ. Then we can recover the λ̂ by
m̄+ m̄−
X X
ii
λ= 2 wii+m − 2jj wjj+1+m+m̄+ = λ̂(w). (5.17)
ii=−m jj=0
In (5.17), m̄+ + 1 is the number of bits for the positive integer part Z+ , m is the
number of bits for the positive decimal part and m̄− + 1 is the number of bits for the
101
' optimality/feasibility cuts
an optimality/feasibility cut
CPU
"
-!
-"
CPU
"
…
QPU QPU
-"
#
Master Subproblem Master
problem solved by problem CPU
solved by classical solved by
quantum computer quantum
computer computer Subproblems
solved by
' classical
(a) (b) computers
Figure 5.3: Flow of HQCBD with a single cut and multi cuts.
negative integer part Z− . Then, the final QUBO formulation of the master problem
is defined as follows.
max c⊺ X + λ̂(w) + ξ1 + ξ2 + ξ3 + ξ4 . (5.18)

X,w
5.3.4 HQCBD Algorithm

Our proposed HQCBD is described by Algorithm 7. Fig. 5.2 shows the overall
flow of HQCBD, while Fig. 5.3(a) shows the detailed interaction between the master
problem and subproblem. The master problem is solved by a quantum computer
and generates a binary solution (X′ ), then sends it to general devices for distributed
computation of subproblems by a classical solver (e.g. Scipy). After subproblems are
solved, an optimality or feasibility cut is sent to the master problem and it continues to
the next round. Specifically, as shown in Algorithm 7, we first initialize the upper and
lower bounds of the problem as well as other parameters, e.g. convergence threshold ϵ
and the number of maximal iterations max itr (Lines 1-2). Then appropriate penalty
numbers or arrays will be generated (Line 4). After that, we reformulate the master
problem in (5.10) in the QUBO format and solve the QUBO problem with a quantum
computer and update the lower bound of the problem λ (Lines 5-7). Given X′ from
the master problem, we solve the subproblem (5.14) and update the upper bound of
the problem λ (Lines 8-9). We finally add the Benders’ cut to the master problem
and continue the next iteration (Lines 10-11) until it converges (Line 3).
102
Algorithm 7 Hybrid Quantum-Classical Benders’ Decomposition (HQCBD)
Input: Distributed network with N servers V , W FL models M , Coefficient of the
objective function and constraints in master problem and subproblem
Output: All decision variables X and Y
1: Initialize upper/lower bound of λ, λ = +∞, λ = −∞
2: Initialize threshold ϵ = 0.001, max itr = 100, itr = 1
3: while |λ − λ| > ϵ and itr < max itr do
4: P ← Appropriate penalty numbers or arrays
5: Q ← Reformulate both objective and constraints in (5.10) and construct QUBO
formulation as (5.18)
6: X′ ← Solve problem (5.18) by quantum computer
7: λ ← Extract w and replace λ with λ̂(w) as (5.17)
8: SU P (X) ← Solve problem (5.14) with fixed X′
9: λ ← SU P (X)
10: Add a Benders’ cut to the master problem as (5.15d)
11: itr+ = 1
12: end while
13: return X, Y
We leverage the D-Wave solver to implement our proposed algorithm to solve the
QUBO master problem. In addition, the penalties also need to be carefully tuned for
a decent QUBO model. In general, a large penalty can cause the quantum annealer
to malfunction due to coefficient explosion. In contrast, a small penalty can make the
quantum annealer ignore the constraints. A well-tuned penalty will lead to a fairly
high probability of the quantum solver giving the correct answer.
5.3.5 Multiple Cuts Version

In Algorithm 7 (Line 11), we only consider one single benders’ cut imported to
the master problem in each round. This cut is computed from the subproblem based
on an optimal feasible solution (X′ ) returned by the quantum computer (Line 6 of
Algorithm 7). However, one of the advances of the quantum algorithm is that it
can generate multiple feasible solutions simultaneously. Therefore, to accelerate the
103
Algorithm 8 Multiple-cuts Benders’ Decomposition (MBD)
Input: Distributed network with N servers V , W FL models M , Coefficient of the
objective function and constraints in master problem and subproblem, number of
cuts σ
Output: All decision variables X and Y
1: Initialize upper/lower bound of λ, λ = +∞, λ = −∞
2: Initialize threshold ϵ = 0.001, max itr = 100, itr = 1
3: while |λ − λ| > ϵ and itr < max itr do
4: P ← Appropriate penalty numbers or arrays
5: Q ← Reformulate both objective and constraints in (5.10) and construct QUBO
formulation as (5.18)
6: {X′ }σ ← Solve problem (5.18) by quantum computer and return σ feasible
solutions
7: λ ← Extract w with highest value and replace λ with λ̂(w) as (5.17)
8: {SU P (X)}σ ← Solve σ subproblems (5.14) with fixed X′ in parallel
9: λ ← {SU P (X)}σ with lowest value
10: Add all σ benders’ cut to the master problem as (5.15d)
11: itr+ = 1
12: end while
13: return X, Y
convergence of the master problem, we further introduce a hybrid quantum-classical

multiple-cuts optimization method. In the multiple-cuts version of the HQCBD algo-
rithm, we leverage the multiple feasible solutions generated by the quantum computer
and select the top σ feasible solutions to further generate multiple cuts. Then multi-
ple cuts are inserted into the master problem per iteration. Fig. 5.3(b) illustrate this
idea.
The detailed algorithm (MBD) is given by Algorithm 8. Compared with the single-
cut version of the HQCBD algorithm, first, these top σ feasible solutions are sent to
σ subproblems and all subproblems execute in parallel (Lines 6 and 8). Second, each
subproblem generates a Benders’ cut and sends it back to the master problem (Line
9). Finally, the master problem collects all Benders’ cuts, adds to the constraints
(Line 10), and continues the next iteration. Note that if one of these subproblems
104
reaches the threshold, the iteration will be stopped since the upper bound and lower
bound converge to the predefined threshold.
5.4 Evaluation
In this section, we simulated a distributed network environment and conducted
experiments of realistic FL tasks using publicly available datasets. To validate the
feasibility of our hybrid quantum-classical optimization algorithm, we run the pro-
posed algorithms on a hybrid D-Wave quantum processing unit (QPU). We accessed
the D-Wave system provided by Leap quantum cloud service [98]. Based on the Pega-
sus topology, the D-Wave system also has over 5k qubits and 35k couplers, which can
solve complex problems of up to 1M variables and 100k constraints. We performed
a number of test cases that can be resolved in under 100 iterations, but only due to
the high cost of QPU utilization and the developer’s time constraints.

Network Setting: Our distributed computing environment consists of 100 servers
where the topology depends on the real-world EUA-Dataset [41] and the Internet
topology zoo [38]. EUA-Dataset is widely used in mobile computing and contains the
geographical locations of 125 cellular base stations in the Melbourne central business
district area, while the Internet topology zoo is a popular network topology dataset
that includes a number of historical network maps all over the world. We randomly
select a set of servers from these topology datasets to conduct simulations. In each
simulation, each server has a maximal storage capacity sci , CPU frequency sfi and
link bandwidth bj belonging to the ranges of 1, 024 ∼ 2, 048GB, 2 ∼ 5GHz, and
512 ∼ 1, 024Mbps, respectively.
Datasets and FL models: We conduct extensive experiments on the follow-
ing real-world datasets: California Housing dataset[81], MNIST[43], Fashion-MNIST
(FMNIST)[121], and CIFAR-10 [39]. These are well-known ML datasets for linear
regression, logistic regression, or image classification tasks. Two models with con-
vex loss functions are implemented on the above real-world datasets for performance
evaluation, which are (i) Linear Regression with MES loss on the California Housing
105
dataset and (ii) Logistic Regression with the cross-entropy loss on MNIST. We are
also interested in the performance of our proposed methods on FL models with non-
convex loss functions. Thus, three datasets, MNIST, FMNIST, and CIFAR-10, are
used to train convolutional neural network (CNN) models with different structures.
Benchmarks and Metrics: We compare our proposed HQCBD and MBD al-
gorithms with three baseline strategies: classical Benders’ decomposition (CBD), ran-
dom algorithm (RAND), and two-stage iterative optimization algorithm (TWSO)[117].
CBD uses a classical LP solver (Gurobi[27] or Scipy[105]) to solve the master problem
and subproblems. RAND randomly generates the random decisions on the model’s
parameter server, FL workers, and local convergence rate under certain constraints.
TWSO is a previous algorithm [117] that decomposes the original problem into two
subproblems (participant selection and learning scheduling) and solves them itera-
tively. The following metrics are adopted to compare the performances of our pro-
posed methods and the baselines: the total cost of FL training, the loss or accuracy
of FL models, the number of iterations, the solver accessing time and the gain or
advancement of our proposed algorithms over CBD.
5.4.2 Simulation Results

Performance of HQCBD
To demonstrate the feasibility and performance of our proposed HQCBD, we

conduct three sets of small-scale experiments with different case settings (servers are
selected from 100 servers). As shown in Table 5.1, there are three cases. The first
case includes 7 servers, 1 FL model, and 3 workers per model with a total of 63 binary
variables. The second case has 7 servers, 2 FL models, and 2 workers per model with
a total of 126 binary variables. The third case consists of 9 servers, 2 FL models,
and 3 workers per model with a total of 198 binary variables. For each case, we
perform both CBD and HQCBD. Fig. 5.4 and Table 5.1 show the related results of
their performances.
In Figs. 5.4(a)-(c), the blue dashed line denotes the upper bound of value λ used
in HQCBD, and the orange dashed line denotes the lower bound of λ in HQCBD.
As we can see, the upper bound and lower bound finally converge and we obtain the
non-negative lower bound at 31st, 45th and 89th round for each case, respectively.
106
4500 16000 Upper bound of
4000 14000 Lower bound of
3500 12000
Value of
Value of
3000 10000
2500 8000
2000 Upper bound of 6000
1500 Lower bound of 4000
0 10 20 30 0 10 20 30 40
Rounds Rounds
(a) Case 1 (b) Case 2
30000 Upper bound of 8000 CBD
HQCBD
Master Problem Value

Lower bound of
25000 7000
Value of
20000
6000
15000
5000
10000
4000
0 20 40 60 80 0 10 20 30 40 50
Rounds Rounds
(c) Case 3 (d) Master problem value
Figure 5.4: Performance of HQCBD: its convergence.
Table 5.1: Iteration comparison of CBD and HQCBD over three different cases.
Case Set up # of Variables Itr. of CBD Itr. of HQCBD
1 {7, 1, 3} 63 32 31
2 {7, 2, 2} 126 55 45
3 {9, 2, 3} 198 91 89
This result proves that our proposed algorithm is mathematically consistent with the
classical Benders’ decomposition algorithm. In addition, Fig. 5.4(d) shows the trend
of the master problem value of case 2 calculated by (5.16) compared with the solution
of CBD. We can see that the value of the master problem keeps increasing until it
converges. Specifically, the master problem value keeps static in the first few rounds
since only an unbounded ray is found in the subproblem and a feasibility cut is added
to the master problem. As we run more iterations, the optimality cut is found and
added to the master problem. Once the difference between the upper bound and
lower bound reaches a threshold, the problem is solved. The solution from HQCBD
is similar to the one from CBD.
107
Solver Accessing time (ms)
Local - CBD Case 1
Gain/Advancement (%)
200 QPU - HQCBD 60 Case 2
Case 3
150
40
100
20
50
0 0
0 10 20 30 40 50 1 2 3 4 5
Rounds Number of Cuts
(a) CBD vs HQCBD (b) MBD gains over CBD
Figure 5.5: Comparison of the real solver accessing time and gains of MBD over CBD
in different cases.
Table 5.2: Solver accessing time (ms) comparison of CBD and HQCBD.
CBD HQCBD
Case
Max / Min Avg / Std Max / Min Avg / Std
1 190.47 / 6.71 117.14 / 50.12 32.10 / 15.93 31.49 / 2.79
2 235.29 / 9.11 129.56 / 50.04 32.11 / 15.92 24.18 / 7.98
3 395.48 / 14.45 120.25 / 63.19 32.11 / 16.01 25.53 / 7.85
Table 5.1 further demonstrates the detailed comparison between CBD and HQCBD
in terms of the number of iterations used to solve the problem. We can find that
HQCBD takes fewer iterations to converge to the optimal solution compared with
CBD (for example, for Case 2, the improvement of iterations is around 18%).
Moreover, we show the comparison of real solver accessing time (i.e., the compu-
tation time of the solvers) for CBD and HQCBD in Table 5.2 and plot the detailed
accessing time of Case 2 in Fig. 5.5. The solver accessing time is the real accessing
time of QPU solver and local solver without considering other overheads, such as
variables setting time, parameters transmission time, and so on. As we can see in Ta-
ble 5.2, the minimal accessing time of CBD is relatively lower than that of HQCBD.
However, the maximal and average accessing time as well as the standard deviation
value of CBD are significantly higher than HQCBD. For example, for Case 2, the
mean accessing time of HQCBD is 81% less than the one of CBD, and more signifi-
cantly the standard deviation of accessing time of HQCBD is 84% less than the one
of CBD. We also confirm via Fig. 5.5(a) that the solver accessing time of CBD in
each round/iteration varies significantly while the solver accessing time of HQCBD
108
3500 8000

3000 7000
2500 6000
CBD CBD
2000 MBD - 1 MBD - 1
MBD - 3 5000 MBD - 3
1500 MBD - 5 MBD - 5
4000
0 10 20 30 0 10 20 30 40 50
Rounds Rounds
(a) Case 1 (b) Case 2
11000 Upper bound - MBD
14000
Lower bound - MBD
10500 12000 Upper bound - CBD

Lower bound - CBD
10000
Value of
10000
CBD 8000
9500 MBD - 1
MBD - 3 6000
9000 MBD - 5 4000
0 20 40 60 80 0 10 20 30 40 50
Rounds Rounds
(c) Case 3 (d) Convergence comparison
Figure 5.6: Performance of MBD: its convergence.
Table 5.3: Iteration of CBD and MBD with different σ.

Case # of Binary var. Itr. of CBD Itr. of MBD (σ = 1/3/5)
1 63 32 31 / 29 / 24
2 126 55 45 / 44 / 29
3 198 91 89 / 36 / 27
in each round keeps stable and is even smaller than that of CBD. This finding proves
the efficiency and robustness of leveraging the hybrid quantum-classical technique to
solve the optimization problem in terms of either the convergence iteration or the
solver accessing time.
Performance of MBD
We now evaluate the efficiency of our proposed MBD algorithm. Similarly, we con-
sider three different cases with different numbers of servers, FL models, and workers.
We study the impact of the number of cuts σ used in MBD and we select the value
from 1, 3, and 5. Recall that when σ = 1, MBD is our standard HQCBD. Table 5.3
109
1500 RAND 2000 RAND
TWSO TWSO
1250 HQCBD 1500 HQCBD
Total Costs
Total Costs
1000
750 1000
500
500
250
0 7 8 9 10 11 0 2 3 4 5 6
Number of Servers Number of Workers
(a) Impact of server number (b) Impact of worker number
Figure 5.7: Performance comparison with existing methods.
and Fig. 5.6 shows the result of multiple cuts and convergence comparison with CBD.
In Fig. 5.6(a)-(c), MBD-1 is our proposed HQCBD algorithm where only a single cut
is added to the master problem, while MBD-3 or MBD-5 means 3 or 5 cuts are added
to the master problem. In this scenario, we can find that our MBD-1 (HQCBD)
converges faster than the CBD. But with more cuts (larger σ), the convergence speed
of MBD-σ becomes faster. Table 5.3 lists the detailed comparison between CBD and
MBD for different cases. Fig. 5.6(d) further demonstrates the upper and lower bound
detailed convergence comparison between our proposed algorithm MBD with σ = 5
and the CBD in Case 2. We can see that our proposed methods use fewer rounds
(29) to converge the optimal value compared with the classical one (55).
We also plot the gain or advancement of MBD over CBD in terms of iteration
reduction for different numbers of cuts in Fig. 5.5(b). Obviously, different numbers
of cuts have achieved different positive gains or advancements in different cases. The
largest improvement is up to 70.3% for Case 3 with σ = 5. This further proves the
efficiency of our both proposed algorithms HQCBD and MBD.
Comparison with Existing Methods
We now compare our proposed method HQCBD with the random method (RAND)
and a two-stage iterative optimization method (TWSO) [117] in terms of solving the
joint optimization problem.
Firstly, we focus on the necessity of the optimization problem and study the
impact of different numbers of servers. We concurrently train 2 FL models with 2
workers per model and the number of servers varies from 7 to 11. Fig. 5.7(a) shows
110
the results. Obviously, RAND has the worst performance due to its randomness. Our
HQCBD algorithm gets further improvements compared with our proposed TWSO
and demonstrates the effectiveness of the HQCBD algorithm. In addition, as the
number of servers increases, the total cost of HQCBD first decreases and increases
then decreases again. This is because the topology may change when the server
number varies and lead to the change of selection decision as well as the total cost.
Next, we investigate the impact of different numbers of FL workers on total costs.
We set the number of servers and FL models to 15 and 2, respectively. The number
of FL workers is in the range of [2, 6]. As shown in Fig. 5.7(b), the total costs increase
as the number of workers increases. This is obvious since the more workers, the more
total costs consumed. Our proposed HQCBD still outperforms RAND and TWSO
algorithms. With more qubits supporting, we expect that the speed of HQCBD will
have a more significant advantage over TWSO on large-scale optimization problems.

In this section, we briefly review the related works in federated learning, learning
scheduling, and hybrid quantum optimization.
5.5.1 Federated Learning

Federated learning emerges as an efficient distributed machine learning approach
to exploit distributed data and computing resources, so as to collaboratively train
machine learning models. Currently, the efforts of FL have been focused on the com-
munication and energy efficiency [64, 131, 48], the convergence and adaptive control
[108, 56], the resource allocation and model aggregation [58, 113, 66]. For exam-
ple, Yang et al. [131] studied the joint computation and transmission optimization
problem aiming to minimize the total energy consumption for FL over wireless com-
munication networks, then proposed an iterative algorithm to derive a near-optimal
solution. Li et al. [48] formulated a compression control problem and proposed a
convergence-guaranteed FL algorithm with flexible communication compression that
allows participants to compress their gradients to different levels before uploading
to the central server. Wang et al. [108] focused on FL training convergence and
111
adaptive control in edge computing without client selection. They proposed a con-
trol algorithm to determine the trade-off between local update and global parameter
aggregation so as to minimize the loss function. Both [56] and [58] considered a client-
edge-cloud hierarchical federated learning (HFL) where cloud and edge servers work
as two-tier parameter servers to aggregate the partial models from mobile clients (i.e.
FL workers). Liu et al. [56] proved the convergence of such an HFL, while Luo et
al. [58] studied a joint resource allocation and edge association problem for device
users under such HFL framework to achieve global cost minimization. Wang et al.
[113] also considered the cluster structure formation in HFL where edge servers are
clustered for model aggregation. Meng et al. [66] focused on model training using
decentralized P2P methods in edge computing. While some of these works also con-
sider learning control of FL, they either consider different FL topologies (e.g. HFL,
DFL) or optimize different objectives.
5.5.2 Client Selection and Learning Scheduling

Client selection and learning scheduling are critical problems, particularly in dis-
tributed federated learning where it is inevitable to communicate among servers.
Hence, client selection or client sampling has been well studied in FL recently [76, 16,
87, 63, 35, 40, 144, 8]. For example, Nishio and Yonetani [76] studied a client selection
problem in edge computing where the edge server acts as a PS and numerous mobile
clients are selected as workers. Their client selection aimed to maximize the number of
selected workers under time constraints. Cho et al. [16] presented a convergence anal-
ysis of FL with biased client selection and observed that biasing the client selection
towards clients with higher local losses increases the rate of convergence compared
to unbiased client selection. Ribero and Vikalo [87] proposed a modified FedAvg al-
gorithm for updating the global model in communication-constrained settings based
on collecting models from clients and only clients the model difference exceeds the
threshold will be sampled for global updates. Marnissi et al. [63] further designed
a client selection strategy based on the gradient norms importance to improve the
communication efficiency of FL. Similarly, Balakrishnan et al. [8] also introduced
diversity in the client selection problem by leveraging submodular maximization. Lai
et al. [40] proposed a framework to guide participant selection in FL aiming to im-
112
prove the training performance and indicated that clients with the greatest utility
can improve model accuracy and hasten the convergence speed. Furthermore, Jin
et al. [35] considered both learning control of FL and edge provisioning problem
in distributed cloud-edge networks. While their work is similar to ours, they didn’t
consider the parameter server selection problem and the remote cloud center always
plays the role of PS in their scenario. In addition, all aforementioned works do not
take the concurrent multiple FL models training case into account which significantly
affects the total training performance of all FL models.
5.5.3 Hybrid Quantum Optimization

Quantum computing (QC) [73, 83] has been proven to be superior to solving many
challenging computationally intensive problems [18, 24, 96, 77, 89, 5]. However, the
application of QC is limited by the current state of a quantum computer (such as
availability or cost). To address this, a hybrid quantum-classical computing frame-
work has been developed for solving a complex optimization problem where both
quantum computers and classical computers are used.
Such hybrid quantum optimization has been newly applied in different areas in-
cluding machine learning, mobile computing, network communication, task schedul-
ing, and classification [102, 3, 80, 142, 2, 21]. For instance, Tran et al. [102] first
proposed a hybrid quantum-classical approach to solve the complete tree search prob-
lem. They decomposed the original problem into the master problem and subprob-
lems where both master problem and subproblems were solved by quantum annealer,
and the global search tree was maintained by the classical computer. Ajagekar et
al. [2] proposed two hybrid QC-based optimization techniques for solving large-scale
mixed-integer linear programming (MILP) and mixed-integer fractional programming
(MIFP) scheduling problems. Similarly, both [142] and [21] introduced a hybrid
quantum-classical algorithm by leveraging a different decomposition technique (Ben-
ders’ Decomposition (BD)) to solve the MILP optimization problem. Paterakis [80]
also provided a hybrid quantum-classical optimization algorithm for unit commit-
ment problems and further introduced a method for employing various cut selection
criteria in order to control the size of the master problem. Inspired by the aforemen-
tioned works, we apply the hybrid quantum-classical framework proposed by [142, 21]
113
to tackle a specific real-world optimization problem that jointly optimizes the client
selection and learning schedule in multi-model FL. For this particular problem, we
propose a distinct solving process where the binary master problem is solved by quan-
tum annealer and subproblems with continuous variables however are addressed by
the classical computer. Different from [142, 21], we also consider the multiple-cuts
strategy to hasten the convergence speed.
5.6 Chapter Summary

In this chapter, a joint participant selection and learning scheduling problem for
multi-model FL has been studied. Motivated by the powerful parallel computing ca-
pabilities of quantum computers, we proposed a quantum-assisted HQCBD algorithm
by employing the complementary strengths of classical optimization and quantum an-
nealing to optimally select participants (both PS and FL workers) and determined the
learning schedule to minimize the total cost of all FL models. In order to accelerate
the convergence speed of our proposed algorithm, we further introduce a multiple-cuts
version of HQCBD (MBD) to hasten the solving process. Extensive simulations on
the D-Wave quantum annealing machine demonstrated the efficiency and robustness
of our proposed HQCBD and MBD algorithms which not only achieved the same
result as the classical algorithm but also took much fewer iterations (up to 70.3%
improvement) and less accessing time (up to 81% reduction) to obtain the desired
solution even at relevantly small scales. With the new development of robust quan-
tum computers with more qubits, we believe the proposed HQCBD-based method
will have great applications in the joint learning scheduling of distributed machine
learning in the near future.
114
CHAPTER 6
DISSERTATION CONCLUSION
In this dissertation, we presented an in-depth study on joint resource management

and task scheduling problems in mobile edge computing. Motivated by the hetero-
geneity of edge elements including edge servers, mobile users, data resources, and
computing tasks, the key challenge is how to effectively manage resources (e.g. data,
services) and schedule tasks (e.g. ML/FL tasks) in the edge clouds so as to meet the
QoS of mobile users or maximize the platform’s utility.
Targeting these challenges, we first proposed a popularity-based data place-
ment strategy in edge computing with the aim of reducing the average forwarding
path length of data. We also adopt a virtual-space-based placement method with
greedy routing-based retrieve, but take into consideration of data popularity when
we generate the coordinates of data items. We carefully design our mapping strategy
so that a popular data item is placed closer to the network center in the virtual plane.
Then the placement of data is purely based on the distance between the data item
and the edge server in the virtual plane. To address the storage limits at servers and
balance the load among edge servers, we further propose several placement strategies
which either offload data items to other servers when the assigned server is over-
loaded or place multiple replicas of the same data item to reduce the assigned load of
servers. In both cases, we do take data popularity into consideration when designing
the offloading and replication strategies.
Next, we jointly studied the resource placement and task dispatching prob-
lems in mobile edge computing with the aim of maximizing the total utility of per-
formed tasks. We formulated the problem as a joint optimization problem under the
storage, CPU, and memory constraints and take the status of edge servers into ac-
count. To tackle the network dynamics across different timescales, we proposed two
alternative approaches: two-stage optimization method and deep reinforce-
ment learning (DRL) method. The two-stage optimization method decomposes
the joint optimization problem into two sub-problems (resource placement and task
dispatching) and then solves them respectively and iteratively. To handle the dynam-
ics in the edge cloud environment and the complexity of the joint optimization process
115
in our cases, we also leveraged reinforcement learning (RL) techniques to tackle our
joint optimization problem.
Last but not least, we considered a multi-model federated edge learning
where multiple FEL models are being trained in the edge network and edge servers
can act as either parameter servers or workers of these FEL models. We formu-
lated a joint participant selection and learning scheduling problem, which
is a non-linear mixed-integer program, aiming to minimize the total cost of all FEL
models while satisfying the desired convergence rate of trained FEL models and the
constrained edge resources. We then designed several algorithms by decoupling the
original problem into two or three sub-problems that can be solved respectively and
iteratively. We even extend our work to other training topologies (e.g., DFL, HFL)
and proposed several heuristic algorithms to solve the optimization problems. We
further proposed a novel Hybrid Quantum-Classical Benders’ Decomposition
(HQCBD) algorithm to tackle the joint participant selection and learning schedule
problem. By combining quantum computing and classical optimization techniques,
our HQCBD algorithm can quickly converge to the desired solution, just like the
classical BD algorithm, but with far fewer iterations and at much faster speeds. We
also presented a multiple-cuts version of HQCBD (MBD) to accelerate con-
vergence by forming multiple cuts in each round using multiple outputs from the
quantum annealer. MBD can achieve varying levels of performance improvement by
selecting different numbers of cuts.
6.1 Future Research

As discussed in this dissertation, I am interested in developing heuristic opti-
mization algorithms for resource management and task scheduling problems that
are motivated by mobile edge computing scenarios, yield new realistic insights and
demonstrate tangible practical impacts. While the existing results are encouraging,
they also raise open questions. In the following, I outline a few promising problems
I intend to pursue in this direction.
Federated Reinforcement Learning in Vertical Perspectives. Despite the
excellent performance that RL and DRL have achieved in many areas, they still face
several important technical and non-technical challenges in solving real-world prob-
116
lems. The successful application of FL in supervised learning tasks arouses interest
in exploiting similar ideas in RL, i.e., FRL. FRL not only provides the experience
for agents to learn to make good decisions in an unknown environment but also en-
sures that the privately collected data during the agent’s exploration does not have
to be shared with others. However, most current works on FRL focus on horizontal
federated reinforcement learning (HFRL), in which the agents may be distributed
geographically, but they face similar decision-making tasks and have very little inter-
action with each other in the observed environments. Hence, I am interested in vertical
federated reinforcement learning (VFRL), which applies the methodology of VFL to
RL and is more realistic for the real-world scenario. In vertical federated learning
(VFL), samples of multiple data sets have different feature spaces but these samples
may belong to the same groups or common users. The training data of each partic-
ipant are divided vertically according to their features. More general and accurate
models can be generated by building heterogeneous feature spaces without releasing
private information. VFRL is suitable for Partial Observation Markov Decision Pro-
cess (POMDP) scenarios where different RL agents are in the same environment but
have different interactions with the environment. Compared with HFRL, there are
currently few works on VFRL. The drawback of current VFRL works is the small
feature space of states and limited training data. In addition, they only contain two
agents, and the structure of the aggregated neural network model is relatively sim-
ple. Hence, it is a great attempt to first implement a more flexible or general VFRL
framework and verify its effectiveness.
Hybrid Quantum-Classical Techniques for Satellite Edge Intelligence.
With the acceleration of beyond 6G wireless communication process, satellite commu-
nication technologies and high-altitude platforms (HAPs) or unmanned aerial vehicle
(UAV) communication technologies have attracted wide attention for their reduced
vulnerability to natural disasters and physical attacks. As a technology that has
been proven and deployed for a long time, satellite communications stand out for its
capacious service coverage capabilities. Recently, the integration of satellite, terres-
trial cell networks, and mobile edge computing has become a general trend for future
networks. However, there are still several challenges in the combination of edge com-
puting and satellites: 1) Limited visibility time of satellites; 2) Terrestrial edge and
cloud infrastructures are generally fixed, but satellites are moving assets; 3) Task
117
assignment and satellite edge state has to migrate across multiple neighbor satellites
if it beyond the coverage; 4) Satellite resources need to be shared with multiple tasks
rather than a specific edge computing task. Therefore, I am interested in developing
heuristic approaches by leveraging hybrid quantum-classical techniques to implement
resource allocation across satellite-terrestrial networks, server assignment for task ex-
ecution, load-aware offloading process, as well as satellite network instability detection
due to the continuous route changes or task assignments. In addition, to facilitate
the efficiency of the hybrid quantum-classical techniques, I am also interested in the
optimization or improvement of quantum computing.
Resource Management and Scheduling Optimization in Quantum Net-
works. Quantum networks use the quantum properties of photons to encode infor-
mation. For instance, photons polarized in one direction (for example, in the direction
that would allow them to pass through polarized sunglasses) are associated with the
value; one, photons polarized in the opposite direction (so they don’t pass through the
sunglasses) are associated with the value zero. Researchers are developing quantum
communication protocols to formalize these associations, allowing the quantum state
of photons to carry information from sender to receiver through a quantum network.
Hence, there exists quantum resource management and we propose optimization algo-
rithms to manage quantum resources and schedule quantum entanglement in quantum
networks.
Collaborative Intelligent Systems for Edge AIoT, AR/VR. Modern infor-
mation or network systems do not serve individual users in a vacuum but rather must
provide service simultaneously for a large number of users. Effective and broadly
applicable learning approaches should have both the flexibility to model and per-
sonalize to individual users, as well as the ability to intelligently balance the ex-
ploration/exploitation trade-off for entire populations of users. In addition to the
aforementioned future directions, I am also interested in developing collaborative in-
telligent systems for edge AIoT, AR/VR, and vehicular ad-hoc networks. It aims
to provide personalized data management and privacy protection services, as well as
integrate resource allocation, task assignment, and self-learning modules. The intel-
ligent collaborative system can suit many scenarios, such as the smart city, smart
healthcare, smart grid, intelligent robots, as well as advanced manufacturing, etc.
118
BIBLIOGRAPHY
[1] Hassan I Abdalla. An efficient approach for data placement in distributed sys-
tems. In 2011 Fifth FTRA international conference on multimedia and ubiqui-
tous engineering, pages 297–301. IEEE, 2011.
[2] Akshay Ajagekar, Kumail Al Hamoud, and Fengqi You. Hybrid classical-
quantum optimization techniques for solving mixed-integer programming prob-
lems in production scheduling. IEEE Transactions on Quantum Engineering,
3:1–16, Jun. 2022.
[3] Akshay Ajagekar, Travis Humble, and Fengqi You. Quantum computing based
hybrid solution strategies for large-scale discrete-continuous optimization prob-
lems. Computers & Chemical Engineering, 132:106630, Jan. 2020.
[4] Mohammad H Al-Shayeji, Sam Rajesh, Manal Alsarraf, and Reem Alsuwaid.
A comparative study on replica placement algorithms for content delivery net-
works. In 2010 Second International Conference on Advances in Computing,
Control, and Telecommunication Technologies, pages 140–142. IEEE, 2010.
[5] Dong An and Lin Lin. Quantum linear system solver based on time-optimal adi-
abatic quantum computing and quantum approximate optimization algorithm.
ACM Transactions on Quantum Computing, 3(2):1–28, Jun. 2022.
[6] Frank Arute, Kunal Arya, Ryan Babbush, Dave Bacon, Joseph C Bardin, Rami
Barends, Rupak Biswas, Sergio Boixo, Fernando GSL Brandao, David A Buell,
et al. Quantum supremacy using a programmable superconducting processor.
Nature, 574(7779):505–510, Oct. 2019.
[7] Cheikh Saliou Mbacke Babou, Doudou Fall, Shigeru Kashihara, Yuzo Taenaka,
Monowar H Bhuyan, Ibrahima Niang, and Youki Kadobayashi. Hierarchical
load balancing and clustering technique for home edge computing. IEEE Access,
8:127593–127607, 2020.
[8] Ravikumar Balakrishnan, Tian Li, Tianyi Zhou, Nageen Himayat, Virginia
Smith, and Jeff Bilmes. Diverse client selection for federated learning via sub-
119
modular maximization. In International Conference on Learning Representa-
tions (ICLR), Virtual, Jan. 2022.
[9] Logan Beal, Daniel Hill, R Martin, and John Hedengren. Gekko optimization
suite. Processes, 6(8):106, 2018.
[10] Ran Bi, Qian Liu, Jiankang Ren, and Guozhen Tan. Utility aware offloading
for mobile-edge computing. Tsinghua Science and Technology, 26(2):239–250,
2020.
[11] Suzhi Bi, Liang Huang, and Ying-Jun Angela Zhang. Joint optimization of
service caching placement and computation offloading in mobile edge computing
systems. IEEE Transactions on Wireless Communications, 19(7):4947–4963,
2020.
[12] Martin Breitbach, Dominik Schäfer, Janick Edinger, and Christian Becker.
Context-aware data and task placement in edge computing environments. In
2019 IEEE International Conference on Pervasive Computing and Communi-
cations (PerCom, pages 1–10. IEEE, 2019.
[13] André Brinkmann, Kay Salzwedel, and Christian Scheideler. Efficient, dis-
tributed data placement strategies for storage area networks. In Proceedings of
the twelfth annual ACM symposium on Parallel algorithms and architectures,
pages 119–128, 2000.
[14] Mingzhe Chen, Zhaohui Yang, Walid Saad, Changchuan Yin, H Vincent Poor,
and Shuguang Cui. A joint learning and communications framework for feder-
ated learning over wireless networks. IEEE Transactions on Wireless Commu-
nications, 20(1):269–283, 2020.
[15] Xianfu Chen, Honggang Zhang, Celimuge Wu, Shiwen Mao, Yusheng Ji, and
Medhi Bennis. Optimized computation offloading performance in virtual edge
computing systems via deep reinforcement learning. IEEE Internet of Things
Journal, 6(3):4005–4018, 2018.
[16] Yae Jee Cho, Jianyu Wang, and Gauri Joshi. Client selection in federated
120
learning: Convergence analysis and power-of-choice selection strategies. arXiv
preprint arXiv:2010.01243, 2020.
[17] P. Chundi, D. J. Rosenkrantz, and S. S. Ravi. Deferred updates and data

placement in distributed databases. In Proceedings of the Twelfth International
Conference on Data Engineering, pages 469–476, 1996.
[18] David Deutsch and Richard Jozsa. Rapid solution of problems by quantum
computation. Proceedings: Mathematical and Physical Sciences, 439(1907):553–
558, Dec. 1992.
[19] Maciej Drwal and Jerzy Józefczyk. Decentralized approximation algorithm for
data placement problem in content delivery networks. In Doctoral Conference
on Computing, Electrical and Industrial Systems, pages 85–92. Springer, 2012.
[20] Nima Eshraghi and Ben Liang. Joint offloading decision and resource allocation
with uncertain task computing requirement. In IEEE INFOCOM 2019-IEEE
Conference on Computer Communications, pages 1414–1422. IEEE, 2019.
[21] Lei Fan and Zhu Han. Hybrid quantum-classical computing for future network
optimization. IEEE Network, 36(5):72–76, Nov. 2022.
[22] Vajiheh Farhadi, Fidan Mehmeti, Ting He, Tom La Porta, Hana Khamfroush,
Shiqiang Wang, and Kevin S Chan. Service placement and request scheduling
for data-intensive applications in edge clouds. In IEEE INFOCOM 2019-IEEE
[23] Fred Glover, Gary Kochenberger, Rick Hennig, and Yu Du. Quantum bridge
analytics i: a tutorial on formulating and using qubo models. 4OR-Q J Oper
Res, 17:335–371, Nov. 2019.
[24] Lov K. Grover. A fast quantum mechanical algorithm for database search.
In Proceedings of the Twenty-Eighth Annual ACM Symposium on Theory of
Computing, STOC ’96, New York, NY, May 1996.
[25] Hongzhi Guo, Jiajia Liu, and Jianfeng Lv. Toward intelligent task offloading
at the edge. IEEE Network, 34(2):128–134, 2019.
121
[26] Wei Guo and Xinjun Wang. A data placement strategy based on genetic algo-
rithm in cloud computing platform. In 2013 10th Web Information System and
Application Conference, pages 369–372. IEEE, 2013.
[27] Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, Jan. 2023.
[28] Chamseddine Hamdeni, Tarek Hamrouni, and F Ben Charrada. Data popularity
measurements in distributed systems: Survey and design directions. Journal of
Network and Computer Applications, 72:150–161, 2016.
[29] Liang Huang, Suzhi Bi, and Ying-Jun Angela Zhang. Deep reinforcement learn-
ing for online computation offloading in wireless powered mobile-edge com-
puting networks. IEEE Transactions on Mobile Computing, 19(11):2581–2593,
2019.
[30] Yaodong Huang, Xintong Song, Fan Ye, Yuanyuan Yang, and Xiaoming Li. Fair
and efficient caching algorithms and strategies for peer data sharing in perva-
sive edge computing environments. IEEE Transactions on Mobile Computing,
19(4):852–864, 2019.
[31] Yaodong Huang, Jiarui Zhang, Jun Duan, Bin Xiao, Fan Ye, and Yuanyuan
Yang. Resource allocation and consensus on edge blockchain in pervasive edge
computing environments. In 2019 IEEE 39th International Conference on Dis-
tributed Computing Systems (ICDCS), pages 1476–1486. IEEE, 2019.
[32] Tomasz Janaszka, Dariusz Bursztynowski, and Mateusz Dzida. On popularity-

based load balancing in content networks. In 2012 24th International Teletraffic
Congress (ITC 24), pages 1–8. IEEE, 2012.
[33] S. Ji, W. Jiang, A. Walid, and X. Li. Dynamic sampling and selective mask-
ing for communication-efficient federated learning. IEEE Intelligent Systems,
37(02):27–34, Mar. 2022.
[34] Shaoxiong Ji, Wenqi Jiang, Anwar Walid, and Xue Li. Dynamic sampling and
selective masking for communication-efficient federated learning. arXiv preprint
arXiv:2003.09603, 2020.
122
[35] Yibo Jin, Lei Jiao, Zhuzhong Qian, Sheng Zhang, and Sanglu Lu. Learning for
learning: Predictive online control of federated learning with edge provisioning.
In IEEE INFOCOM 2021-IEEE Conference on Computer Communications,
pages 1–10. IEEE, 2021.
[36] Yibo Jin, Lei Jiao, Zhuzhong Qian, Sheng Zhang, Sanglu Lu, and Xiaoliang
Wang. Resource-efficient and convergence-preserving online participant selec-
tion in federated learning. In IEEE International Conference on Distributed
Computing Systems (ICDCS), 2020.
[37] Junghoon Kim, Taejoon Kim, Morteza Hashemi, Christopher G Brinton, and
David J Love. Joint optimization of signal design and resource allocation in
wireless D2D edge computing. In IEEE INFOCOM 2020-IEEE Conference on
Computer Communications, pages 2086–2095. IEEE, 2020.
[38] Simon Knight, Hung X Nguyen, Nickolas Falkner, Rhys Bowden, and Matthew
Roughan. The internet topology zoo. IEEE Journal on Selected Areas in Com-
munications, 29(9):1765–1775, Oct. 2011.
[39] Alex Krizhevsky. Learning multiple layers of features from tiny images. Tech-
nical report, University of Toronto, Apr. 2009.
[40] Fan Lai, Xiangfeng Zhu, Harsha V Madhyastha, and Mosharaf Chowdhury.
Oort: Efficient federated learning via guided participant selection. In Pro-
ceedings of the 15th USENIX Symposium on Operating Systems Design and
Implementation (OSDI), Virtual, Jul. 2021.
[41] Phu Lai, Qiang He, Mohamed Abdelrazek, Feifei Chen, John Hosking, John
Grundy, and Yun Yang. Optimal edge user allocation in edge computing with
variable sized vector bin packing. In International Conference on Service-
Oriented Computing, pages 230–245. Springer, 2018.
[42] S. S. Lam and C. Qian. Geographic routing in d-dimensional spaces with

guaranteed delivery and low stretch. IEEE/ACM Transactions on Network-
ing, 21(2):663–677, 2013.
123
[43] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-
based learning applied to document recognition. Proceedings of the IEEE,
86(11):2278–2324, 1998.
[44] Lin-Wen Lee, Peter Scheuermann, and Radek Vingralek. File assignment in
parallel I/O systems with minimal variance of service time. IEEE Transactions
on Computers, 49(2):127–140, 2000.
[45] Chunlin Li, Jingpan Bai, and JianHang Tang. Joint optimization of data place-
ment and scheduling for improving user experience in edge computing. Journal
of Parallel and Distributed Computing, 125:93–105, 2019.
[46] Ji Li, Hui Gao, Tiejun Lv, and Yueming Lu. Deep reinforcement learning based
computation offloading and resource allocation for mec. In 2018 IEEE Wireless
Communications and Networking Conference (WCNC), pages 1–6. IEEE, 2018.
[47] Jun Li, Hao Wu, Bin Liu, Jianyuan Lu, Yi Wang, Xin Wang, YanYong Zhang,
and Lijun Dong. Popularity-driven coordinated caching in named data net-
working. In 2012 ACM/IEEE Symposium on Architectures for Networking and
Communications Systems (ANCS), pages 15–26. IEEE, 2012.
[48] Liang Li, Dian Shi, Ronghui Hou, Hui Li, Miao Pan, and Zhu Han. To talk
or to work: Flexible communication compression for energy efficient federated
learning over heterogeneous mobile edge devices. In IEEE INFOCOM 2021-
IEEE Conference on Computer Communications, pages 1–10. IEEE, 2021.
[49] Qiang Li, Kun Wang, Suwei Wei, Xuefeng Han, Lili Xu, and Min Gao. A
data placement strategy based on clustering and consistent hashing algorithm
in cloud computing. In 9th International Conference on Communications and
Networking in China, pages 478–483. IEEE, 2014.
[50] Ting Li, Zhijin Qiu, Lijuan Cao, Dazhao Cheng, Weichao Wang, Xinghua Shi,
and Yu Wang. Privacy-preserving participant grouping for mobile social sensing
over edge clouds. IEEE Transactions on Network Science and Engineering,
8(2):865–880, 2020.
124
[51] Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, and Zhihua Zhang. On
the convergence of FedAvg on non-IID data. arXiv preprint arXiv:1907.02189,
2019.
[52] Youqi Li, Fan Li, Lixing Chen, Liehuang Zhu, Pan Zhou, and Yu Wang. Power
of redundancy: Surplus client scheduling for federated learning against user
uncertainties. IEEE Transactions on Mobile Computing, 2022.
[53] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom
Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with
deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
[54] Wei Yang Bryan Lim, Nguyen Cong Luong, Dinh Thai Hoang, Yutao Jiao, Ying-
Chang Liang, Qiang Yang, Dusit Niyato, and Chunyan Miao. Federated learning
in mobile edge networks: A comprehensive survey. IEEE Communications
Surveys & Tutorials, 22(3):2031–2063, 2020.
[55] Bing Lin, Fangning Zhu, Jianshan Zhang, Jiaqing Chen, Xing Chen, Naixue N
Xiong, and Jaime Lloret Mauri. A time-driven data placement strategy for
a scientific workflow combining edge computing and cloud computing. IEEE
Transactions on Industrial Informatics, 15(7):4254–4265, 2019.
[56] Lumin Liu, Jun Zhang, SH Song, and Khaled B Letaief. Client-edge-cloud hier-
archical federated learning. In ICC 2020-2020 IEEE International Conference
on Communications (ICC), pages 1–6. IEEE, 2020.
[57] Yang Liu, Tong Feng, Mugen Peng, Jianfeng Guan, and Yu Wang. Dream:
Online control mechanisms for data aggregation error minimization in privacy-
preserving crowdsensing. IEEE Transactions on dependable and secure comput-
ing, 19(2):1266–1279, 2020.
[58] Siqi Luo, Xu Chen, Qiong Wu, Zhi Zhou, and Shuai Yu. HFEL: Joint edge
association and resource allocation for cost-efficient hierarchical federated edge
learning. IEEE Transactions on Wireless Communications, 19(10):6535–6548,
2020.
125
[59] Qin Lv, Pei Cao, Edith Cohen, Kai Li, and Scott Shenker. Search and replica-
tion in unstructured peer-to-peer networks. In Proceedings of the 16th interna-
tional conference on Supercomputing, pages 84–95, 2002.
[60] Chenxin Ma, Jakub Konečnỳ, Martin Jaggi, Virginia Smith, Michael I Jordan,
Peter Richtárik, and Martin Takáč. Distributed optimization with arbitrary
local solvers. Optimization Methods and Software, 32(4):813–848, 2017.
[61] Xiao Ma, Ao Zhou, Shan Zhang, and Shangguang Wang. Cooperative service
caching and workload scheduling in mobile edge computing. In IEEE INFO-
COM 2020-IEEE Conference on Computer Communications, pages 2076–2085.
IEEE, 2020.
[62] John MacCormick, Nicholas Murphy, Venugopalan Ramasubramanian, Udi

Wieder, Junfeng Yang, and Lidong Zhou. Kinesis: A new approach to replica
placement in distributed storage systems. ACM Transactions On Storage
(TOS), 4(4):1–28, 2009.
[63] Ouiame Marnissi, Hajar El Hammouti, and El Houcine Bergou. Client se-
lection in federated learning based on gradients importance. arXiv preprint
arXiv:2111.11204, Nov. 2021.
[64] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and
Blaise Aguera y Arcas. Communication-efficient learning of deep networks from
decentralized data. In Artificial Intelligence and Statistics, pages 1273–1282.
PMLR, 2017.
[65] Qianyu Meng, Kun Wang, Xiaoming He, and Minyi Guo. Qoe-driven big data
management in pervasive edge computing environment. Big Data Mining and
Analytics, 1(3):222–233, 2018.
[66] Zeyu Meng, Hongli Xu, Min Chen, Yang Xu, Yangming Zhao, and Chunming
Qiao. Learning-driven decentralized machine learning in resource-constrained
wireless edge computing. In IEEE INFOCOM 2021-IEEE Conference on Com-
puter Communications, pages 1–10. IEEE, 2021.
126
[67] Erfan Meskar and Ben Liang. Fair multi-resource allocation in mobile edge
computing with multiple access points. In Proceedings of the Twenty-First
International Symposium on Theory, Algorithmic Foundations, and Protocol
Design for Mobile Networks and Mobile Computing, pages 11–20, 2020.
[68] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Tim-
othy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asyn-
chronous methods for deep reinforcement learning. In International conference
on machine learning, pages 1928–1937, 2016.
[69] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Ve-
ness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland,
Georg Ostrovski, et al. Human-level control through deep reinforcement learn-
ing. nature, 518(7540):529–533, 2015.
[70] Han Mongnam, Lee Youngseok, Moon Sue B., Jang Keon, and
Lee Dooyoung. Crawdad dataset kaist/wibro (v. 2008-06-04), 2020.
https://crawdad.org/kaist/wibro/20080604.
[71] Nuno Moniz and Luâ€™is Torgo. Multi-source social feedback of online news
feeds. CoRR, https://arxiv.org/abs/1801.07055, 2018.
[72] Samrat Nath and Jingxian Wu. Deep reinforcement learning for dynamic com-
putation offloading and resource allocation in cache-assisted mobile edge com-
puting systems. Intelligent and Converged Networks, 1(2):181–198, 2020.
[73] National Academies of Sciences, Engineering, and Medicine. Quantum comput-

ing: progress and prospects. National Academies Press, 2019.
[74] Minh NH Nguyen, Nguyen H Tran, Yan Kyaw Tun, Zhu Han, and Choong Seon
Hong. Toward multiple federated learning services resource sharing in mobile
edge networks. arXiv preprint arXiv:2011.12469, 2020.
[75] Zhaolong Ning, Peiran Dong, Xiaojie Wang, Joel JPC Rodrigues, and Feng
Xia. Deep reinforcement learning for vehicular edge computing: An intelligent
offloading system. ACM Transactions on Intelligent Systems and Technology
(TIST), 10(6):1–24, 2019.
127
[76] Takayuki Nishio and Ryo Yonetani. Client selection for federated learning with
heterogeneous resources in mobile edge. In ICC 2019-2019 IEEE International
Conference on Communications (ICC), pages 1–7. IEEE, 2019.
[77] Siyuan Niu and Aida Todri-Sanial. Effects of dynamical decoupling and pulse-
level optimizations on IBM quantum computers. IEEE Transactions on Quan-
tum Engineering, 3:1–10, Aug. 2022.
[78] Tao Ouyang, Rui Li, Xu Chen, Zhi Zhou, and Xin Tang. Adaptive user-managed
service placement for mobile edge computing: An online learning approach. In
IEEE INFOCOM 2019-IEEE Conference on Computer Communications, pages
1468–1476. IEEE, 2019.
[79] Stephen Pasteris, Shiqiang Wang, Mark Herbster, and Ting He. Service place-
ment with provable guarantees in heterogeneous edge computing systems. In
IEEE INFOCOM 2019-IEEE Conference on Computer Communications, pages
514–522. IEEE, 2019.
[80] Nikolaos G Paterakis. Hybrid quantum-classical multi-cut benders approach

with a power system application. arXiv preprint arXiv:2112.05643, Dec. 2021.
[81] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,

M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Ma-
chine learning in Python. Journal of Machine Learning Research, 12:2825–2830,
2011.
[82] Konstantinos Poularakis, Jaime Llorca, Antonia M Tulino, Ian Taylor, and
Leandros Tassiulas. Joint service placement and request routing in multi-cell
mobile edge computing networks. In IEEE INFOCOM 2019-IEEE Conference
on Computer Communications, pages 10–18. IEEE, 2019.
[83] John Preskill. Quantum computing in the NISQ era and beyond. Quantum,
2:79, Aug. 2018.
[84] G. M. Shafiqur Rahman, Tian Dang, and Manzoor Ahmed. Deep reinforcement
learning based computation offloading and resource allocation for low-latency
128
fog radio access networks. Intelligent and Converged Networks, 1(3):243–257,
2020.
[85] Ragheb Rahmaniani, Teodor Gabriel Crainic, Michel Gendreau, and Walter
Rei. The benders decomposition algorithm: A literature review. European
Journal of Operational Research, 259(3):801–817, Jun 2017.
[86] Google Research. Google cluster data (clusterdata 2011 traces), 2011.
https://github.com/google/cluster-data.
[87] Monica Ribero and Haris Vikalo. Communication-efficient federated learning

via optimal client sampling. arXiv preprint arXiv:2007.15197, Oct. 2020.
[88] Krzysztof Rzadca, Anwitaman Datta, and Sonja Buchegger. Replica placement
in P2P storage: Complexity and game theoretic analyses. In 2010 IEEE 30th
International Conference on Distributed Computing Systems, pages 599–609.
IEEE, 2010.
[89] Özlem Salehi, Adam Glos, and Jaroslaw Adam Miszczak. Unconstrained binary
models of the travelling salesman problem variants for quantum optimization.
Quantum Information Processing, 21(2):67, Jan. 2022.
[90] Gamal Sallam and Bo Ji. Joint placement and allocation of virtual network
functions with budget and capacity constraints. In IEEE INFOCOM 2019-
IEEE Conference on Computer Communications, pages 523–531. IEEE, 2019.
[91] Felix Sattler, Simon Wiedemann, Klaus-Robert Müller, and Wojciech Samek.
Robust and communication-efficient federated learning from non-iid data. IEEE
transactions on neural networks and learning systems, 31(9):3400–3413, 2019.
[92] Ohad Shamir, Nati Srebro, and Tong Zhang. Communication-efficient dis-
tributed optimization using an approximate newton-type method. In Inter-
national conference on machine learning (ICML), Beijing, China, Jun. 2014.
[93] Yanling Shao, Chunlin Li, and Hengliang Tang. A data replica placement strat-
egy for iot workflows in collaborative edge and cloud environments. Computer
Networks, 148:46–59, 2019.
129
[94] Dian Shen, Junzhou Luo, Fang Dong, and Junxue Zhang. Virtco: joint coflow
scheduling and virtual machine placement in cloud data centers. Tsinghua
Science and Technology, 24(5):630–644, 2019.
[95] Weisong Shi, Jie Cao, Quan Zhang, Youhuizi Li, and Lanyu Xu. Edge com-
puting: Vision and challenges. IEEE Internet of Things Journal, 3(5):637–646,
2016.
[96] Peter W. Shor. Polynomial-time algorithms for prime factorization and discrete
logarithms on a quantum computer. SIAM J. Comput., 26(5):1484–1509, Oct.
1997.
[97] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and
Martin Riedmiller. Deterministic policy gradient algorithms. In JMLR, 2014.
[98] D-Wave Systems. D-wave hybrid solver service: An overview, 2022.

https://www.dwavesys.com/resources/white-paper/d-wave-hybrid-solver-
service-an-overview/.
[99] Pulp Team. Pulp 2.3, 2020. https://coin-or.github.io/pulp/.
[100] Jing Tian, Zhi Yang, and Yafei Dai. A data placement scheme with time-
related model for P2P storages. In Seventh IEEE International Conference on
Peer-to-Peer Computing (P2P 2007), pages 151–158. IEEE, 2007.
[101] Nguyen H Tran, Wei Bao, Albert Zomaya, Minh NH Nguyen, and Choong Seon
Hong. Federated learning over wireless networks: Optimization model design
and analysis. In IEEE INFOCOM 2019-IEEE Conference on Computer Com-
munications, pages 1387–1395. IEEE, 2019.
[102] Tony Tran, Minh Do, Eleanor Rieffel, Jeremy Frank, Zhihui Wang, Bryan
O’Gorman, Davide Venturelli, and J Beck. A hybrid quantum-classical approach
to solving scheduling problems. In Proceedings of the International Symposium
on Combinatorial Search, New York, USA, Jul. 2016.
[103] Manghui Tu, Hui Ma, Liangliang Xiao, I-Ling Yen, Farokh Bastani, and Di-
anxiang Xu. Data placement in P2P data grids considering the availability,
130
security, access performance and load balancing. Journal of grid computing,
11(1):103–127, 2013.
[104] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning
with double Q-learning. In Proceedings of the AAAI conference on artificial
intelligence, 2016.
[105] Pauli Virtanen, Ralf Gommers, et al. SciPy 1.0: Fundamental Algorithms for
Scientific Computing in Python. Nature Methods, 17:261–272, Feb. 2020.
[106] Jiadai Wang, Lei Zhao, Jiajia Liu, and Nei Kato. Smart resource allocation
for mobile edge computing: A deep reinforcement learning approach. IEEE
Transactions on emerging topics in computing, 2019.
[107] Mingjun Wang, Jinghui Zhang, Fang Dong, and Junzhou Luo. Data placement
and task scheduling optimization for data intensive scientific workflow in mul-
tiple data centers environment. In 2014 Second International Conference on
Advanced Cloud and Big Data, pages 77–84. IEEE, 2014.
[108] Shiqiang Wang, Tiffany Tuor, Theodoros Salonidis, Kin K Leung, Christian
Makaya, Ting He, and Kevin Chan. Adaptive federated learning in resource
constrained edge computing systems. IEEE Journal on Selected Areas in Com-
munications, 37(6):1205–1221, 2019.
[109] Tao Wang, Shihong Yao, Zhengquan Xu, and Shan Jia. DCCP: an effective
data placement strategy for data-intensive computations in distributed cloud
computing systems. The Journal of Supercomputing, 72(7):2537–2564, 2016.
[110] Ying Wang, Yifan Dong, Songtao Guo, Yuanyuan Yang, and Xiaofeng Liao.
Latency-aware adaptive video summarization for mobile edge clouds. IEEE
Transactions on Multimedia, 22(5):1193–1207, 2019.
[111] Yu Wang and Xiang-Yang Li. Efficient Delaunay-based localized routing for
wireless sensor networks. Wiley International Journal of Communication Sys-
tem, 20(7):767–789, 2007.
131
[112] Yu Wang, Chih-Wei Yi, Minsu Huang, and Fan Li. Three dimensional greedy
routing in large-scale random wireless sensor networks. Ad Hoc Networks Jour-
nal, 11(4):1331–1344, 2013.
[113] Zhiyuan Wang, Hongli Xu, Jianchun Liu, He Huang, Chunming Qiao, and
Yangming Zhao. Resource-efficient federated learning with hierarchical aggre-
gation in edge computing. In IEEE INFOCOM 2021-IEEE Conference on Com-
puter Communications, pages 1–10. IEEE, 2021.
[114] Pete Warden. Speech commands: A dataset for limited-vocabulary speech

recognition. arXiv preprint arXiv:1804.03209, 2018.
[115] Qingsong Wei, Bharadwaj Veeravalli, Bozhao Gong, Lingfang Zeng, and Dan
Feng. CDRM: A cost-effective dynamic replication management scheme for
cloud storage cluster. In 2010 IEEE international conference on cluster com-
puting, pages 188–196. IEEE, 2010.
[116] Xinliang Wei, Jiyao Liu, Xinghua Shi, and Yu Wang. Participant selection for
hierarchical federated learning in edge clouds. In IEEE International Conference
on Networking, Architecture, and Storage (NAS 2022), 2022.
[117] Xinliang Wei, Jiyao Liu, and Yu Wang. Joint participant selection and learning
scheduling for multi-model federated edge learning. In IEEE 19th International
Conference on Mobile Ad Hoc and Smart Systems (MASS), Denver, CO, Oct.
2022.
[118] Xinliang Wei, ABM Mohaimenur Rahman, and Yu Wang. Data placement
strategies for data-intensive computing over edge clouds. In 2021 IEEE Inter-
national Performance, Computing, and Communications Conference (IPCCC),
pages 1–8. IEEE, 2021.
[119] Xinliang Wei and Yu Wang. Popularity-based data placement with load bal-
ancing in edge computing. IEEE Transactions on Cloud Computing, 2021.
[120] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a novel image
dataset for benchmarking machine learning algorithms, 2017.
132
[121] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image
dataset for benchmarking machine learning algorithms. ArXiv, abs/1708.07747,
Aug. 2017.
[122] Junjie Xie, Chen Qian, Deke Guo, Xin Li, Shouqian Shi, and Honghui Chen.
Efficient data placement and retrieval services in edge computing. In 2019 IEEE
39th International Conference on Distributed Computing Systems (ICDCS),
pages 1029–1039. IEEE, 2019.
[123] Junjie Xie, Chen Qian, Deke Guo, Minmei Wang, Shouqian Shi, and Honghui
Chen. Efficient indexing mechanism for unstructured data sharing systems in
edge computing. In IEEE INFOCOM 2019-IEEE Conference on Computer
Communications, pages 820–828. IEEE, 2019.
[124] Jie Xu, Lixing Chen, and Pan Zhou. Joint service caching and task offloading
for mobile edge computing in dense networks. In IEEE INFOCOM 2018-IEEE
[125] Qiang Xu, Zhengquan Xu, Tao Wang, et al. A data-placement strategy based
on genetic algorithm in cloud computing. International Journal of Intelligence
Science, 5(03):145, 2015.
[126] Zichuan Xu, Lizhen Zhou, Sid Chi-Kin Chau, Weifa Liang, Qiufen Xia, and
Pan Zhou. Collaborate or separate? distributed service caching in mobile edge
clouds. In IEEE INFOCOM 2020-IEEE Conference on Computer Communi-
cations, pages 2066–2075. IEEE, 2020.
[127] Lei Yang, Haipeng Yao, Jingjing Wang, Chunxiao Jiang, Abderrahim Bensli-
mane, and Yunjie Liu. Multi-uav-enabled load-balance mobile-edge computing
for iot networks. IEEE Internet of Things Journal, 7(8):6898–6908, 2020.
[128] Song Yang, Nan He, Fan Li, Stojan Trajanovski, Xu Chen, Yu Wang, and
Xiaoming Fu. Survivable task allocation in cloud radio access networks with
mobile edge computing. IEEE Internet of Things Journal, 8(2):1095–1108,
2020.
133
[129] Song Yang, Fan Li, Meng Shen, Xu Chen, Xiaoming Fu, and Yu Wang. Cloudlet
placement and task allocation in mobile edge computing. IEEE Internet of
Things Journal, 6(3):5853–5863, 2019.
[130] Song Yang, Fan Li, Stojan Trajanovski, Xu Chen, Yu Wang, and Xiaoming Fu.
Delay-aware virtual network function placement and routing in edge clouds.
IEEE Transactions on Mobile Computing, 20(2):445 – 459, 2021.
[131] Zhaohui Yang, Mingzhe Chen, Walid Saad, Choong Seon Hong, and Mohammad
Shikh-Bahaei. Energy efficient federated learning over wireless communication
networks. IEEE Transactions on Wireless Communications, 20(3):1935–1949,
2020.
[132] Wencong You, Lei Jiao, Sourav Bhattacharya, and Yuan Zhang. Dynamic
distributed edge resource provisioning via online learning across timescales. In
IEEE SECON, 2020.
[133] Dong Yuan, Yun Yang, Xiao Liu, and Jinjun Chen. A data placement strategy
in scientific cloud workflows. Future Generation Computer Systems, 26(8):1200–
1214, 2010.
[134] Chen Zhang, Hongwei Du, Qiang Ye, Chuang Liu, and He Yuan. DMRA: a
decentralized resource allocation scheme for Multi-SP mobile edge computing.
In 2019 IEEE 39th International Conference on Distributed Computing Systems
(ICDCS), pages 390–398. IEEE, 2019.
[135] Jiale Zhang, Bing Chen, Yanchao Zhao, Xiang Cheng, and Feng Hu. Data
security and privacy-preserving in edge computing paradigm: Survey and open
issues. IEEE access, 6:18209–18237, 2018.
[136] Jie Zhang, Hongzhi Guo, Jiajia Liu, and Yanning Zhang. Task offloading in
vehicular edge computing networks: A load-balancing solution. IEEE Transac-
tions on Vehicular Technology, 69(2):2092–2104, 2019.
[137] Wei Zhang, Xiao Chen, and Jianhui Jiang. A multi-objective optimization
method of initial virtual machine fault-tolerant placement for star topological
134
data centers of cloud systems. Tsinghua Science and Technology, 26(1):95–111,
2021.
[138] Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. Character-level convolutional
networks for text classification. In NIPS, 2015.
[139] L. Zhao and J. Liu. Optimal placement of virtual machines for supporting
multiple applications in mobile edge networks. IEEE Transactions on Vehicular
Technology, 67(7):6533–6545, July 2018.
[140] L. Zhao, W. Sun, Y. Shi, and J. Liu. Optimal placement of cloudlets for access
delay minimization in sdn-based internet of things networks. IEEE Internet of
Things Journal, 5(2):1334–1344, April 2018.
[141] Qing Zhao, Congcong Xiong, Xi Zhao, Ce Yu, and Jian Xiao. A data place-
ment strategy for data-intensive scientific workflows in cloud. In 2015 15th
IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing,
pages 928–934. IEEE, 2015.
[142] Zhongqi Zhao, Lei Fan, and Zhu Han. Hybrid quantum benders’ decomposition
for mixed-integer linear programming. In IEEE Wireless Communications and
Networking Conference (WCNC), Austin, TX, Apr. 2022.
[143] Han-Sen Zhong, Hui Wang, Yu-Hao Deng, Ming-Cheng Chen, Li-Chao Peng,
Yi-Han Luo, Jian Qin, Dian Wu, Xing Ding, Yi Hu, et al. Quantum computa-
tional advantage using photons. Science, 370(6523):1460–1463, Dec. 2020.
[144] Hongbin Zhu, Yong Zhou, Hua Qian, Yuanming Shi, Xu Chen, and Yang Yang.
Online client selection for asynchronous federated learning with fairness con-
sideration. IEEE Transactions on Wireless Communications, Oct. 2022. doi:
10.1109/TWC.2022.3211998.
135

Wei These

Uploaded by

Copyright:

Available Formats

You might also like

Wei These

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Wei These

Uploaded by

Copyright:

Available Formats

JOINT RESOURCE MANAGEMENT AND TASK

SCHEDULING FOR MOBILE EDGE COMPUTING

Examining Committee Members:

Dr. Yu Wang, Advisor, Dept. of Computer & Information Sciences

In recent years, edge computing has become an increasingly popular computing

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii

1.2 Major Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Dissertation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 POPULARITY-BASED DATA PLACEMENT . . . . . . . . . . . . . . . 4

2.2 Data Popularity and Design Overview . . . . . . . . . . . . . . . . . 6

2.2.1 Network Models and Data Placement Problem . . . . . . . . . 6

2.2.2 Data Popularity . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.3 Design Overview . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Virtual Coordinate Construction . . . . . . . . . . . . . . . . . . . . 11

2.3.1 Calculating Coordinates of Data Items based on Data Popularity 11

2.3.2 Calculating Coordinates of Edge Servers based on Network Dis-

2.4 Data Placement and Retrieve . . . . . . . . . . . . . . . . . . . . . . 15

2.4.2 Retrieving Data from Edge Servers . . . . . . . . . . . . . . . 16

2.5 Data Placement with Limited Storage . . . . . . . . . . . . . . . . . . 18

2.5.1 Processing Order for Data Placement . . . . . . . . . . . . . . 18

2.5.2 Offloading Choice . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.5.3 Data Retrieve . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.6 Data Placement with Multiple Replicas . . . . . . . . . . . . . . . . . 22

2.6.1 Number of Replicas . . . . . . . . . . . . . . . . . . . . . . . . 23

2.6.2 Placing Replicas . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.7.1 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.7.2 Comparison with Existing Methods . . . . . . . . . . . . . . . 27

2.7.3 Global Retrieve vs Local Retrieve . . . . . . . . . . . . . . . . 28

2.7.4 Placement Strategies with Storage Limits . . . . . . . . . . . . 30

2.7.5 Placement Strategies with Data Replicas . . . . . . . . . . . . 31

2.8 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.9 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3 JOINT RESOURCE PLACEMENT AND TASK DISPATCHING . . . . . 36

3.2 System Models and The Optimization . . . . . . . . . . . . . . . . . 38

3.2.1 Network and System Models . . . . . . . . . . . . . . . . . . . 38

3.2.2 Resource Placement . . . . . . . . . . . . . . . . . . . . . . . 39

3.2.3 Task Dispatching . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.3 Two-Stage Optimization Method . . . . . . . . . . . . . . . . . . . . 43

3.3.1 Two-Stage Optimization . . . . . . . . . . . . . . . . . . . . . 43

3.3.2 Joint Optimization across Two Timescales . . . . . . . . . . . 45

3.4 Reinforcement Learning based Method . . . . . . . . . . . . . . . . . 47

3.4.1 RL Framework: State, Action, and Reward . . . . . . . . . . . 48

3.4.2 DDPG RL Algorithm . . . . . . . . . . . . . . . . . . . . . . . 49

3.4.3 RL Method across Two Timescales . . . . . . . . . . . . . . . 51

3.5.1 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.5.2 Overall Performance . . . . . . . . . . . . . . . . . . . . . . . 54

3.5.3 Running Time and Convergence of OPT . . . . . . . . . . . . 55

3.5.4 OPT across Two Timescales with Dynamic Status . . . . . . . 56

3.5.5 Performance and Convergence of RL . . . . . . . . . . . . . . 58

3.6 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.6.1 Resource Placement/Management . . . . . . . . . . . . . . . . 59

3.6.2 Task Offloading/Dispatching . . . . . . . . . . . . . . . . . . . 61

3.6.3 Deep Reinforcement Learning . . . . . . . . . . . . . . . . . . 62

3.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4 JOINT PARTICIPANT SELECTION AND SCHEDULING IN FL . . . . 64

4.2 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.2.1 Edge Cloud Model . . . . . . . . . . . . . . . . . . . . . . . . 66

4.3 Joint Participant Selection and Learning Optimization Problem . . . 70