Professional Documents
Culture Documents
Wei These
Wei These
Wei These
A Dissertation
Submitted to
the Temple University Graduate Board
in Partial Fulfillment
of the Requirements for the Degree of
DOCTOR OF PHILOSOPHY
by
Xinliang Wei
May, 2023
ii
ABSTRACT
iii
CPU computer, but with fewer convergence iterations. We have also demonstrated
that the hybrid quantum-classical Benders’ decomposition technique has the potential
to be applied to solve a larger-scale scenario in the near future.
Keywords: Resource Management, Edge Computing, Federated Learning, Multi-
stage Optimization, Hybrid Quantum-Classical Technique
iv
ACKNOWLEDGEMENTS
One of the most meaningful experiences of my life has been and will always be
pursuing my doctorate at Temple University. The obstacles and challenges I faced
over the past four and a half years and eventually overcame have been very beneficial
to me. I have developed the skills necessary to be a qualified researcher throughout
this process, as well as a rigorous attitude toward research. I want to sincerely thank
everyone who has supported and helped me.
First of all, I would like to express my deepest gratitude to my Ph.D. advisor,
Prof. Yu Wang. During the past four years, he has been so patient in guiding me in
my research and taught me invaluable lessons in both doing research and handling
problems in life. Without his innumerable and continuous support, I would have
not been able to accomplish as much as I have. Prof. Wang has devoted his time
and efforts to advising my research, discussing popular research topics, and sharing
his insightful ideas with me. he was always trying his best to help and support me
by introducing collaborators in specific areas to me so as to motivate me with new
insights. He also gave me great help and suggestions during my job interviews. I am
so fortunate to have Prof. Yu Wang as my advisor when pursuing my Ph.D. degree.
As introduced by Prof. Yu Wang, I have the chance to collaborate with Prof.
Zhu Han and Prof. Lei Fan at the University of Houston, as well as Prof. Yuanx-
iong Guo at the University of Texas at San Antonio. All of them helped me a lot
in the recent research regarding the hybrid quantum-classical technique. I would like
to thank Prof. Zhu Han who guided me in exploring the hybrid quantum-classical
solutions to solve the joint optimization problem in federated learning and the in-
tegrated space-air-ground networks. At the same time, I appreciate Prof. Lei Fan
who helped me understand and improve the formulation of the optimization problem
from a mathematical perspective. Moreover, the discussion with Prof. Yuanxiong
Guo always inspires me to understand the nature of the problem and spot potential
problems.
My thanks are also extended to my Ph.D. committee members: Prof. Yan Wang,
and Prof. Hongchang Gao for their time and valuable comments on this dissertation.
Their suggestions help improve the quality of this dissertation to a large extent. I
would like to thank all my current and former colleagues in the department as well as
v
visiting scholars. I appreciate their friendship and the happy time we spent together
at Temple University.
Last but not least, I want to thank my girlfriend, Dr. Siyun Chen, for her emo-
tional and spiritual support. Her love and encouragement are the strongest driving
force for the completion of my doctorate program. I would also like to express my
deepest gratitude to my beloved parents for their understanding of my study abroad.
vi
TABLE OF CONTENTS
Page
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
ACKNOWLEDGEMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
CHAPTER
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
tance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
vii
2.4.1 Placing Data to Edge Servers . . . . . . . . . . . . . . . . . . 15
2.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
viii
3.2.4 Joint Optimization Problem . . . . . . . . . . . . . . . . . . . 43
3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
ix
4.2.2 Federated Learning over Edge . . . . . . . . . . . . . . . . . . 66
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
x
5.3.3 Quantum Formulation for Master Problem . . . . . . . . . . . 100
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
xi
LIST OF FIGURES
path length. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.6 Illustration of the virtual coordinates of two data items in the virtual
plane. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.13 Comparison of global retrieve and local retrieve in OUR-B and OUR-S. 30
2.14 Distribution of placed data items among servers for OUR-B and four
OUR-S strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
xii
3.1 A typical edge cloud environment. . . . . . . . . . . . . . . . . . . . . 37
3.2 Illustration of joint resource placement and task dispatching across two
timescales. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.5 Resource placement and task dispatching via deep reinforcement learn-
4.2 The training process of an FL model within the edge network at dif-
algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
xiii
4.8 Comparison of two different processing orders of FL models. . . . . . 84
4.10 Training accuracy with three FL tasks and the impact of FL workers. 86
5.3 Flow of HQCBD with a single cut and multi cuts. . . . . . . . . . . . 102
5.5 Comparison of the real solver accessing time and gains of MBD over
xiv
LIST OF TABLES
5.1 Iteration comparison of CBD and HQCBD over three different cases. 107
5.2 Solver accessing time (ms) comparison of CBD and HQCBD. . . . . . 108
xv
CHAPTER 1
INTRODUCTION
1.1 Background
Recently, there has been tremendous growth in mobile edge computing in both
academia and industry due to its advances over traditional cloud computing (e.g.,
low latency, agility, and privacy). Especially as the increasing amount of data and
services offered by diverse applications and IoT/smart devices, network operators and
service providers are likely to build and deploy computing resources (such as data,
models, and services) at the edge of the network near users to shorten the response
time and support real-time intelligence applications.
A typical edge computing environment consists of mobile users, edge clouds (in-
cluding multiple edge servers connected by the edge network), and a remote cloud
(usually within data centers). Each edge server is generally deployed at the network
edge near mobile users and owns specific storage, CPU, and memory capacity. Mobile
users can generate a couple of computation tasks at any location which requests to
be dispatched at edge servers with sufficient resources (i.e., internal computation re-
sources such as CPU, memory, storage) and may also require certain data or services
(i.e., external resources such as training data or machine learning services). Note
that the types of computing tasks from mobile users/devices are heterogeneous due
to diverse settings and applications. For example, some tasks may only request data
(e.g. image, video) or machine learning (ML) model from the edge network, and
then process it locally or perform ML computation based on the model at the local
edge server. Some tasks may request computation at other edge servers with certain
computation services, such as video analysis, speech recognition, and 3D rendering.
Some tasks may need a combination of data, services, and computation resources,
such as distributed federated learning or interactive augmented reality.
Despite the superiority brought by edge computing, however, due to the hetero-
geneity of edge elements including edge servers, mobile users, data resources, and
computing tasks, the key challenge is how to effectively manage resources (e.g. data,
services) and schedule tasks (e.g. ML/FL tasks) in the edge clouds to meet the QoS
1
of mobile users or maximize the platform’s utility. The goal of this dissertation is to
build practical solutions to solve the joint resource management and task scheduling
for mobile edge computing. We aim to address the following questions:
1. How and where to place resources in the edge network to minimize the total
accessing cost of all mobile users?
2. Where to dispatch the computing tasks to maximize the total utility of per-
formed tasks?
3. In the specific case of federated learning tasks, how to optimally select partici-
pants, and determine the local learning rate as well as the learning topology for
multi-model federated learning scenarios to minimize the total learning cost of
all FL models?
4. How to handle the edge network dynamics across different timescales or the
large-scale edge network scenario?
2
two timescales to deal with different dynamics of tasks and resources in mobile
edge computing.
• Next, we formulate a new joint participant selection and learning schedule prob-
lem of multi-model federated edge learning as a mixed-integer programming
problem, with a goal to minimize the total FEL cost while satisfying various
constraints. We decouple the original optimization problem into two or three
sub-problems and then propose three algorithms to effectively find participants
and learning rates for each FEL model, by iteratively solving the sub-problems.
3
CHAPTER 2
2.1 Introduction
Edge computing has grown in popularity as a computing paradigm for enabling
real-time data processing and mobile intelligence in recent years. Edge computing
refers to computing at the network’s edge, where data is generated and distributed
at nearby edge servers to reduce data access latency and improve data processing
efficiency. One of the key challenges in data-intensive edge computing is determining
how to effectively place data at edge clouds so that access latency to the data is
minimized.
As shown in Fig. 2.1, a typical edge computing environment consists of several
entities: mobile user, edge server, edge network, and remote cloud. Unlike the cloud
environment, edge servers are geographically dispersed at the edge of the network
near the mobile users and own heterogeneous computing and storage capability [95,
31, 110, 10, 65, 72, 128]. Each edge server can provide services for those mobile
users in the specific nearby area by holding some data/models and performing the
computation task based on data/models. Hereafter, we use data to refer to both
data and models as long as they are required for performing the service requested
by mobile users1 . When a mobile user requests data, its request is forwarded to the
nearest edge server. If the edge server has the data, it can respond to the mobile
user immediately with the data (as Data C in Fig. 2.1) or perform the corresponding
computing service for the user. Otherwise, the edge server has to retrieve the data
from other edge servers (Data A or B) or even from the remote cloud (Data F). Data
placement is a critical issue in edge computing since the location of data affects the
response latency of the requested service. If the data is stored at a nearby edge server,
the service can be performed very quickly, while a request needed to access a remote
cloud takes much longer to be performed. In addition, as shown in Fig. 2.1, multiple
1
Here, we do not differentiate the personal data or public data, as long as the data/model will be
used/shared by multiple users at different locations. Also, different security and privacy protection
techniques [135, 57, 50] can be applied before the data placement.
4
Data F
Remote Cloud
Data D Data B
Request Data C
Request Data B Base Station
Request Data F
Mobile Users
Mobile Users
Request Data B Request Data A
Mobile Users
mobile users at different locations may request the same data (Data B) and different
data has diverse popularity (i.e., different number of requests from users). Therefore,
in this chapter, we study the data placement problem in edge computing with the
consideration of data popularity.
Data placement has been well studied in distributed systems [17, 1, 13, 59, 103,
100, 88, 19, 4, 49, 125, 26, 109, 141, 107, 133]. However, edge computing has its own
characteristics [95], such as proximity, fluctuation, and heterogeneity. Edge servers
deployed in the edge network are in the proximity of mobile users compared with
the distributed system (e.g. cloud computing). So it improves the speed of data
processing as a direct result of lower latency. In addition, devices are usually user-
controlled and can leave the edge network at any time. That means the network
status is fluctuating over time. Furthermore, the topology of edge environments is
heterogeneous and dynamic, which will bring another challenge to the data place-
ment, e.g. how to maintain the existing data already stored in the edge server when
the topology is changed. Thus data placement problem in edge computing has also
drawn significant attention from researchers recently [93, 55, 45, 12]. But most of
them formulate the data placement problem as an optimization problem and leverage
complex optimization solvers to tackle it. Such methods suffer from high computa-
5
tion and communication overheads, which makes them not suitable for large-scale
systems. Most recently, Xie et al. [123, 122] proposed a novel virtual space-based
method, which maps both switches and data indexes/items into a virtual space and
places data based on virtual distance in the space. Their method can enable efficient
retrieval via greedy forwarding. However, none of them consider data popularity
when placing data on edge servers.
In this section, we investigate the static data placement strategy based on data
popularity in edge computing to reduce the average forwarding path length of data.
Inspired by [123, 122], we also adopt a virtual-space-based placement method with
greedy routing-based retrieve, but take into consideration of data popularity when
we generate the coordinates of data items. Based on an observation that in a dense
network, the node in the central region has a smaller shortest path to other areas
compared with nodes in the surrounding regions, we carefully design our mapping
strategy so that a popular data item is placed closer to the network center in the
virtual plane. Then the placement of data is purely based on the distance between
the data item and the edge server in the virtual plane. To address the storage limits at
servers and balance the load among edge servers, we further propose several placement
strategies which either offload data items to other servers when the assigned server
is overloaded or place multiple replicas of the same data item to reduce the assigned
load of servers. In both cases, we do take data popularity into consideration when
designing the offloading and replication strategies. Simulation results show that our
proposed strategies can achieve better performance compared to existing solutions
[123, 122]. Moreover, both the offloading and replication strategies can effectively
handle the storage pressure of overloaded edge servers.
6
it has a specific maximal storage capacity ci = c(vi ). Let lij = l(vi , vj ) to represent
the shortest path length from edge server vi to vj in G, we then have a distance
matrix L = {lij } which holds lengths of all shortest paths in the edge network.
Assume that we have W data items, D = {d1 , d2 , · · · , dW }, in the system. Each
data di has a specific data size si = s(di ) and data popularity pi = p(di ) (which
will be explained in the next subsection). For each of data item di , we need to find
an edge server vj to hold it. Then the data placement problem can be represented
as finding a mapping f from D to V , where f (di ) = vj . The goal of the data
placement problem is to find a mapping to minimize the average access cost (or
delay) to stored data items in edge network G and also balance the load among edge
servers. Xie et al. [122] proposed a nice virtual-space-based data placement strategy
for edge computing problems however they did not consider data popularity among
data items di . Compared with a complex optimization-based data placement strategy,
the virtual-space-based method is much simple and easy to implement.
7
3
Virtual space
Data Plane p7 = 20
p2 = 7 p15 = 10 p17 = 11
p37 = 18
p27 = 15 p15 = 12
Data popularity can be assessed differently for the distributed system depending
on the application. In general, there are three factors contributing to the data pop-
ularity: the number of accesses (i.e., how many times the data item is requested),
the lifetime, and the request distribution over time or space. In this dissertation, we
simply use the number of accesses as the data popularity. However, it is not difficult
to extend our definition to include other two factors (or even other data popularity
measurements) into our system. For each data item di , we assume that its data pop-
ularity pi = p(di ) describes its number of access requests over time. We assume that
data popularity for each data is known to the system. Larger data popularity means
the data item is more frequently accessed by mobile users in the system. Obviously,
the locations of the popular data items are at the roots of the overall data placement
problem, compared with those of unpopular data. Note that there could also be more
complex data popularity models, where various user or location-specific preferences
may be considered differently even for the same data item. Our proposed method
may be further extended to deal with such models by treating the data preferences
from different users/locations with different weights or more refined models. We leave
such study as one of the future works.
9
Control Plane
Determine Data Placement
Data Items Coordinates Decision
(Sec. 4.1) (Sec. 5.1)
10
forward it based on its virtual coordinate and the installed forwarding rules. To han-
dle load balancing among edge servers and further reduce the data access delay, we
propose several additional placement strategies, which offload data items to nearby
servers when the storage of desired edge server of our placement method exceeds its
maximal limit, and new replica placement strategies, which strategically place multi-
ple replicas to serve users while considering the data popularity to decide the number
of replicas with favor to more popular data.
In summary, in our design, data popularity places a central role. It has been
considered during the construction of virtual coordinates of data items, the selection
of offloading choices, and the decision on the number of replicas to deploy.
11
Average length of shortest paths
50
40 36.89
30 26.37
23.468
20.034
20 19.28
10
0
0-0.2 0.2-0.4 0.4-0.6 0.6-0.8 0.8-1.0
Distance to the network center
(a) network topology (b) shortest path length
Figure 2.5: An example of a physical network topology and the relation of shortest
path length.
Second, the mapping method needs to take data popularity into consideration and
places the popular data items in the location which has smaller shortest paths to other
regions. Last, the mapping method should be deterministic, i.e., given the same data
item, the output of our mapping method should be the same. This can guarantee
that for the data request on the same data item, our retrieval process can lead to the
same location in the virtual plane.
In our solution, we map each data di to a virtual location in the virtual plane whose
polar coordinates are r(di ) and θ(di ). To consider the data popularity in the mapping,
our design is based on the following observation. In a dense network, the center area
has smaller shortest paths to all regions. Fig. 2.5(b) shows that the average length of
shortest paths to other servers and the distance to the network center for each server
in a randomly deployed network with 50 servers (Fig. 2.5(a)). The servers closer to
the center of the network have a less total length of all shortest paths. Based on this
observation, our mapping method puts a popular data item near the center of the
virtual plane. Specifically, we generate r(di ) ∈ [0, 1] using
where pmax is the maximal data popularity among all data items. By doing so, the
more popular data is, the closer to the center point as shown in Fig. 2.6(a). To spread
data items at different regions, we calculate the angle θ(di ) using the hash value of
the data’s ID. Particularly, we first calculate the hash value H(di ) by using a hash
function H (e.g. SHA-256). Next, we reduce the hash value to the scope of the virtual
12
y
d1 x(d1)
r(d1) d1
!(d1) y(d1)
!(d2) x
o o
y(d2)
d2
r(d2)
d2
x(d2)
(a) (b)
Figure 2.6: Illustration of the virtual coordinates of two data items in the virtual
plane.
space by (1) using only the last 4 bytes of H(di ) and converting them to a 4-byte
binary value h(di ), and (2) normalizing h(di ) between 0 and 2π. In other words,
By doing so, we place this data item along a certain direction in Polar coordinates.
Different data items will be spread all different directions. Even the data items with
the same data popularity will be placed at different locations. The final polar coor-
dinates are (r(di ), θ(di )), whose corresponding Cartesian coordinates can be obtained
by
x(di ) = r(di ) × cos θ(di )
(2.3)
y(d ) = r(d ) × sin θ(d ).
i i i
Fig. 2.6 illustrates the relationship of virtual coordinates between Polar and Cartesian
coordinates. All data items are mapped into a circular region with a unit radius in
the virtual plane. The construction of coordinates for all data items can be done in
O(W ), where W is the number of data items.
13
distance. By doing so, when we place popular data items near the center in the
virtual plane, the accessing cost of them will be relevantly smaller since the cost is
proportional to the distance in the virtual plane. In addition, this will ensure the local
retrieve (which picks the next server based on their virtual coordinates) has a low
routing stretch. This mapping problem is a network embedding (or graph embedding)
problem, which has been well studied. Given the network topology G and the shortest
path measurements among edge servers, we adopt the M-position algorithm used by
[123, 122] to generate the virtual coordinates of edge servers in the 2D virtual plane.
For completeness, we briefly review the basic idea of such an algorithm. Given the
network topology G, we can obtain the shortest path matrix L = {lij }, where lij is
the shortest path length from edge server i to j. Using L as the input, the M-position
algorithm aims to calculate the coordinates of edge servers, which can be represented
as a coordinate matrix Q (a 2 × N matrix of N edge servers in the two-dimensional
virtual plane), i.e.,
" #
x(v1 ) x(v2 ) · · · x(vN )
Q= .
y(v1 ) y(v2 ) · · · y(vN )
The key idea behind the M-position algorithm is based on the fact that Q can be
derived from a scalar product matrix B = 21 JL(2) J via the eigenvalue decomposition
[123]. The major steps of the mapping algorithm to generate coordinates of edge
servers are given as follows.
1. Given the network topology G, generate the shortest path matrix L = {lij } and
compute L(2) = {lij
2
}, which is the squared distance matrix.
14
The construction of coordinates for all edge servers takes O(N 3 ), which is dom-
inated by the complexity of all-pairs shortest path and eigen decomposition of the
matrix.
Fig. 2.7 shows examples of data placement output. The assignment of servers forms a
Voronoi diagram (Fig. 2.7(a)) in the region, where if a data item falls within a Voronoi
cell then it will be placed at the edge server that owns the Voronoi cell. Since the
more popular data items are more towards the center of the network (as shown in
Fig. 2.7(b)), they will be placed on edge servers whose shortest paths to other servers
are shorter. It is noted that the center of the network is relative to the virtual plane,
which is the center of the circular region. The popular data items are placed on the
servers near the center of the network, which is determined based on the distances
on this virtual plane. The data placement decision is made within O(W N ) where N
and W are the numbers of edge servers and data items, respectively.
15
d45
v34 v4
d27
d16
d87 v2
v47 v12 v46 d7
v8
d32 d34 d86
16
vt=f(d1) vt =f(d1) vt =f(d1)
d1 d1 va d1
vs vs vs
vb
directly (which needs the knowledge of coordinates of all servers), in local retrieve,
the data request is routed towards the coordinate (x(di ), y(di )) of the data item in the
virtual plane. At each server when receiving the request, it first checks whether this
data is placed in itself. If the current edge server is the target server, it will reply to
the request. Otherwise, the current server greedily selects the next server to forward
from its neighboring server based on its coordinates. The criteria are to pick up the
server whose coordinate is nearest to the target coordinate (x(di ), y(di )) in the virtual
plane. This procedure repeats until the request reaches the target server. Notice that
greedy forwarding may fail at the local minimum, but randomized forwarding can be
used to get out of the local minimum2 . However, compared with the global retrieve,
local retrieve causes longer retrieval latency which is due to (1) a longer exploration
process to find the target server and (2) a longer founded delivery path between source
and target servers.
Fig. 2.8 illustrates the difference between these two retrieve methods. There is
a trade-off between the performance (retrieve latency) and the complexity (comput-
ing, storing, and updating the global shortest paths). Our proposed data placement
method supports both retrieval methods and can select the appropriate one based on
different application scenarios.
2
There are other methods to eliminate the local minimum for greedy routing via adjusting trans-
mission range [112] or building a Delaunay graph [111, 42].
17
2.5 Data Placement with Limited Storage
We have introduced the basic data placement strategy based on data popularity
in the previous section. However, we did not consider the storage capacity at the
edge server yet. Based on the basic placement strategy some of the edge servers may
be overloaded with data items. If each edge server vi has specific maximal storage
capacity ci = c(vi ) such that it can only store data items whose total size is up to
ci . Hereafter, we use cc(vi ) to denote the current storage usage of server vi . In this
section, we propose some simple heuristics to handle load balancing3 . due to storage
limits. All the heuristics use the output (i.e., f (di )) of our basic data placement
as their input, but they are different from each other in (1) the ordering of data
placement and (2) the choice of offloading server. Here, this offloading is just an
additional step during the making of data placement decisions. The real placement
of data items on servers happened after the final data placement decision. In addition,
due to the offloading, data retrieval also needs to be able to find the offloading server.
18
Algorithm 1 Data Placement in Order of Data Popularity
Input: The placement decision place dec = {(di , f1 (di ))} determined by Ba-
sic Placement.
Output: The updated new placement decision place dec.
1: Sort place dec in descending order of popularity p(di );
2: for each item = (di , vj ) in place dec do
3: if s(di ) + cc(vj ) > c(vj ) then
4: vl = Find Offloading Server(di , vj );
5: Update and confirm item = (di , vl ), i.e.,f2 (di ) = vl ;
6: cc(vl ) += s(di );
7: else
8: Confirm item = (di , vj ), i.e.,f2 (di ) = vj ;
9: cc(vj ) += s(di );
10: return place dec = {(di , f2 (di ))}
summation of data size s(di ) of di and the current server storage cc(vj ) of vj does
not exceed the maximal server storage c(vj ), then we confirm this placement and
place this data di to this edge server vj . Otherwise, we find an available edge server
vl to offload this data item (denoted by a procedure Find Offloading Server) and
modify the initial placement decision f2 (di ) = vl . Multiple ways to find such an edge
server to offload will be discussed in the next subsection. The detailed algorithm is
presented in Algorithm 1. By performing this algorithm, we can guarantee that each
server has sufficient storage to hold all assigned data items and avoid the overloading
of certain edge servers. The total time complexity of Algorithm 1 is O(W log W +
W X), where O(W log W ) is from ordering the data popularity and O(W X) is for
W rounds of finding an offloading server for each data item. Here, X is the cost
of Find Offloading Server, which is bounded by the number of neighbors of the
server vi or by N depending on which method is used.
Placing Data in the Order of Server Capacity The second method processes
the data placement in a different order which is based on the maximal edge server
storage capacity. The idea of this strategy is to deal with the edge server that has
a bigger maximal storage capacity first. When determining which data should be
placed on the current edge server and which should be offloaded, we continue to
19
Algorithm 2 Data Placement in Order of Server Capacity
Input: The placement decision place dec = {(di , f1 (di ))} determined by Ba-
sic Placement.
Output: The updated new placement decision place dec.
1: Sort V in descending order of server capacity c(vi );
2: for each vi in V do
3: Generate Di based on place dec;
4: Sort Di in descending order of data popularity p(dk );
5: for each dk in Di do
6: if s(dk ) + cc(vi ) > c(vi ) then
7: vl = Find Offloading Server(dk , vi );
8: Update/confirm dk ’s placement, i.e.,f3 (dk ) = vl ;
9: cc(vl ) += s(dk );
10: else
11: Confirm dk ’s placement, i.e.,f3 (dk ) = vi ;
12: cc(vi ) += s(dk );
13: return place dec = {(dk , f3 (dk ))}
take into account data popularity, where more popular data is easier to stay at the
current server. The algorithm acts as follows. First, we sort the list of edge servers
V in descending order according to the maximal edge server storage capacity c(vi ),
such that c(v1 ) ≥ c(v2 ), · · · , ≥ c(vN ). For each server vi , we define Di as a list
that consists of all data items assigned to vi by the basic data placement, i.e., Di =
{dk |vi = f1 (dk )}. Then we process the edge server to confirm or update the data
placement on that server. For each server vi , Di is sorted based on data popularity
in descending order. We process the data item dk ∈ Di . If placing this item does
not exceed the maximal storage of vi , i.e., s(dk ) + cc(vi ) ≤ c(vi ), its placement is
confirmed. Otherwise, we simply call the procedure Find Offloading Server to
find a near server to place it and update its placement f3 (dk ). The whole process
is repeated for all data items on all servers. The detailed algorithm is presented
in Algorithm 2. The major difference with the first method is that the processing
order is based on server capacity (the outer “for” loop in Algorithm 2). The total
time complexity of Algorithm 2 is O(N log N + N
P
i=1 (|Di | log |Di | + |Di |X)). Here,
20
cc(vf )=0 cc(ve)=2 cc(vf )=0 cc(ve)=2 cc(vf )=2 cc(ve)=2
v v v
c(vf )=5 f ve c(ve)=4 c(vf )=5 f ve c(ve)=4 c(vf )=5 f ve c(ve)=4
O(N log N ) is from ordering the server capacity, O(|Di | log |Di |) is from ordering the
data popularity in Di , and O(|Di |X) is |Di | rounds of find offloading server for each
data item in Di . Note that N
P
i=1 |Di | = W .
21
2.5.3 Data Retrieve
Since the proposed methods might offload data items from the original assigned
server by vj = f1 (di ) to another server, saying vl = f2 (di ) or f3 (di ), we need to have
a way to let data retrieve method to find the new server. Note that even though
the network controller may know the location of the new server, each individual
server when receiving the data item may not know the global information of server
capability, thus failing to find the offloading decision. Instead of broadcasting all
offloading decisions to the whole network, a simple solution is to let the original
server vj host a forwarding entry to record the path towards the new server vl . By
doing so, the data retrieval methods can stay the same and the data request of di is
still forwarded towards vj . When it reaches vj , it is then further forwarded to vl . This
will cost additional path length and retrieval latency. However, since the offloading
server is selected to minimize the distance between vj and vl , the additional cost is
minimized.
22
choose the edge server to place these replicas? Next, we answer these two questions
separately for our data placement design.
Recall that N is the number of edge servers, smax and pmax are the maximal data
size and data popularity. cc(vj ) and c(vj ) are the current used storage and maximal
storage limit of vj , thus c(vj ) − cc(vj ) is the remaining storage capacity of vj . α1 , α2 ,
and α3 are weights added to these three coefficients since the relative importance of
the three aspects can vary based on the data characteristics and system conditions.
While α1 + α2 + α3 = 1, they can be adjusted to meet different requirements. For
instance, with increasing data popularity, we can use higher α2 to increase the number
of replicas for the more popular data. Similarly, if our edge network system has limited
storage capacity or low bandwidth, we may decrease α3 or α1 to meet the requirement.
β is a ratio parameter to control the total number of replicas with respect to the total
number of N . Larger β leads to more total replicas.
23
2
di
1
di
r(d1i ) 1
1 !(di )
!(di ) d
2
i
o o
d
1
i 3 r(d2i )
r(di ) di
r(d3i ) 3
di
(a) (b)
Figure 2.10: Illustration of the virtual coordinates of data replicas generated by our
method.
achieve all of these, we modify our mapping method which generates the coordinates
of data items to map a data item to n(di ) locations in the virtual plane based on its
popularity. Both the placement and retrieve methods can still be the same, where
each replica is just placed on the nearest server in the virtual plane, and data requests
are forwarded to that server during retrieving.
Calculating Coordinates of Replicas. For each data item di , we will generate
n(di )
n(di ) data items, denoted by d1i , d1i , · · · , di , and spread them on the virtual plane.
Inspired by the Voronoi diagram, we spread all replicas using the radius which depends
on the data popularity but different angles. As shown in Fig. 2.10(a), the Voronoi
diagram formed by all replicas is evenly distributed in the virtual plane. The polar
coordinates of k-th replica dki in the virtual plane are given by the following equations.
p(di )
r(dki ) = 1 − pmax
,
(2.5)
θ(dki ) = 2π 2h(d i)
32 −1 +
2π
k n(d i)
.
Note that the first copy of data item d1i is mapped to the same location as di . The
radius and angle are still deterministically decided by data popularity and data index.
2π
The other replicas are evenly distributed by an angle difference of n(di )
with the same
radius. This solution seems to achieve all desired goals, but it may have a problem
when the data item is very popular. In that case, the radius is small, all replicas will
be placed around the center of the network. Though their Voronoi cells are equal,
this is not ideal since multiple replicas will be nearby. Therefore, we further modify
the mapping method, by defining a threshold τ < 0.5. If r(dki ) < τ , we shift r(dki ) by
24
adding 0.5, except for the first copy of the data which is still at the original location.
Fig. 2.10(b) shows such an example and the Voronoi cells of all replicas. Then, the
new mapping method is given as follows.
p(di )
1.5 − pmax
, 1 − pp(di)
< τ and k > 1
r(dki ) = { max
,
1 − pp(d i)
max
, otherwise (2.6)
θ(dki ) = 2π 2h(d i) 2π
32 −1) + k n(d ) .
i
In our simulations, we use τ = 0.01. After we have the coordinates of all replicas,
we can place these replicas on the closest edge server in virtual space. The retrieval
procedure is straightforward too.
Our new placement method with replicas makes sure that (1) the more popular
data is, at least one copy of the data is closer to the center; (2) different data replicas
are well spread in the virtual plane; (3) the shortest paths are reduced since copies
of data can be found at multiple locations.
2.7 Evaluation
In this section, we report the results from our simulations to evaluate our proposed
data placement strategies.
25
server(s) and then perform data retrieves of all data items randomly from all edge
servers based on their data popularity. Mainly, three performance metrics are used
to evaluate the performance of the proposed methods:
We test seven different versions of our data placement strategies in the simulations,
and they are
• OUR-S. This set of four data placement strategies where data items exceeding
the maximal storage limit at the assigned server by OUR-B will be offloaded to
nearby servers. Since we have two methods for processing offloading in different
orders (Algorithm 1 and Algorithm 2) and two methods for different offload-
ing choices (Nearest Neighbor and Nearest Server), we have four different
OUR-S methods in total:
26
– OUR-S3: Algorithm 2 + Nearest Neighbor,
• OUR-R. This is the data placement strategy that places multiple replicas in
the network and the number of replicas n(di ) of a data item di depends on its
popularity, size, and available storage. For comparison, we also implement and
test another version OUR-R-fixed where fixed numbers of replicas are used.
For all of these data placement strategies, both global retrieve and local retrieve can
be used. At last, the simulation runs 100 times to get the average result.
27
60
54.542
50
10
0
COIN OUR-B OPT GREDOUR-S1 COIN OUR-B OPT GREDOUR-S1
Placement Strategy Placement Strategy
other parts of the network, the path lengths of all methods are longer than those in
the left plot. In addition, we can also observe that our proposed algorithms (OUR-B
or OUR-S1) are much better than the existing methods (COIN or GRED) in this
case. This is mainly due to that our proposed methods consider data popularity and
ensure that the more popular data is closer to the network center where the average
path length to the boundary region is much shorter.
Obviously, without considering the data storage limit, the average accessing cost
of the OPT is better than all methods. However, such an optimal method will in-
crease the storage burden of the selected server, because all data items are placed
on the single optimal server. On the contract, our proposed methods spread all data
to different servers based on their data popularity and virtual distances. Thus, our
methods can balance the storage burden among edge servers while keeping relevantly
small accessing costs as shown in Fig. 2.11. In addition, if the storage limit and/or
request distribution are considered, finding the optimal server becomes a very chal-
lenging optimization problem.
28
Average retrieval latency (ms)
50 2.00
Global 1.75 Global
Note that the average path length of different numbers of requests is almost the same
for both retrieve methods. This is reasonable since the data placement is static.
More importantly, the global retrieve strategy has a shorter average path length than
the local retrieve strategy. This is due to that the global strategy always takes the
shortest path between the source and target servers. Fig. 2.12(b) presents the average
retrieve latency of two retrieve methods. As we can see, the average retrieve latency
of the local strategy is far larger than that of the global strategy. This is due to that
(1) the local retrieve strategy takes more hops during the forwarding; (2) it also may
need to perform random forwarding to escape from local minimums.
Second, we fixed the number of requests at 1, 000 and test the two retrieve strate-
gies with both OUR-B and four OUR-S placement methods. Fig. 2.13 presents the
results. From the results, we can draw the following conclusions. (1) Similar to results
in Fig. 2.12, local retrieve takes a longer path and latency than global retrieve does
in all cases. (2) For the average path length in global retrieve, it is clear that those
of OUR-S is longer than that of OUR-B. Remember that in OUR-S data items may
be offloaded to another edge server rather than the nearest server to the data. Thus,
global retrieve needs an additional path to reach the target server. (3) For the local
retrieve, the average path lengths of OUR-S are shorter than that of OUR-B. This
might be due to some of the data items being offloaded to servers that are closer to
the request edge server. (4) It is obvious that the average retrieve latency of OUR-S
is longer than that of OUR-B for both global and local retrieve since the retrieve
takes additional time to find where the data is stored. Interestingly, even though the
average path length of local retrieve for OUR-S is shorter than that for OUR-B, local
29
Average retrieval latency (ms)
50 2.5
Global Global2.191 2.174 2.181 2.21
30 1.5
22.383 22.369 23.497 23.813 1.12
21.869
20 1.0
10 0.5
0.209 0.245 0.244 0.255 0.243
0 0.0
OUR-B OUR-S1 OUR-S2 OUR-S3 OUR-S4 OUR-B OUR-S1 OUR-S2 OUR-S3 OUR-S4
Placement Strategy Placement Strategy
(a) average path length (b) average retrieve latency
Figure 2.13: Comparison of global retrieve and local retrieve in OUR-B and OUR-S.
retrieve still needs more time to figure out the location of the data item thus leading
to a much longer retrieve latency than OUR-B.
In summary, local strategy in general leads to longer average path length and larger
retrieve latency than global retrieve. However, the local strategy only utilizes the
neighbors’ information to compute the forwarding decision, which saves the storage
of all shortest paths and makes it work well in scale. Therefore, there is a trade-off
between the performance (retrieve latency) and the complexity (computing, storing,
and updating the global shortest paths). For all simulation results in the remaining
section, we only report the results from global retrieve due to space limitations.
30
OUR-B 30 OUR-B
25 OUR-S1
Number of servers
OUR-S1 OUR-S2
20 OUR-S3
OUR-S4
OUR-S2 15
10
OUR-S3
5
OUR-S4
0
0-1 2-3 4-5 6-7 8-9 10-11 >11
Range of number of data
(a) loads among all servers (b) server load distribution
Figure 2.14: Distribution of placed data items among servers for OUR-B and four
OUR-S strategies.
• OUR-B, the basic placement strategy where only a single copy of the data
item is placed.
• OUR-R, the proposed placement strategy with multiple replicas, where the
number of replicas is calculated based on data popularity, data size, and avail-
able storage.
To treat all methods with multiple replicas fairly, we let the fixed number of replicas in
Random-R-fixed/OUR-R-fixed be equal to the average number of replicas on OUR-R.
First, we display the loads among all servers with multiple data replicas as shown
in Fig. 2.15. As we can see, there are more data in each edge server compared with
the single replica strategy OUR-B. In addition, the difference in loads between these
three multiple replica strategies is not obvious due to data duplication. However, the
average path length is different as shown in the next figure.
31
OUR-B 12
10
# of data
7 8 7 8
4 4 3 4 534 5
0202 21300010 0 1 10 1000 3
0122 000 1 0 00 0 000
3
0
1111 12 1012 11 10 1011 9 9 9 10 9 9 11
8 9 6 10 8 10
OUR-R
10 79 799 75569
# of data
54 768 5 6 7 8 76
5
86
5
4 3 3
0
9 7 10 10 10 10
Random-R-fixed
# of data 10 9 97 99 9 7 8 6685555 7 9 98
533 665 6 4634 64 4 4 66 5 4
3 2 3 1 2 3
0
13 15 13 13
11
OUR-R-fixed
9 9 1110 10 9 9 8 1110 13 9 1010 8
10
# of data
68 757764 56 77 5 3
8 7
45
7 7 657 8
4 65 666
0
0 1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334353637383940414243444546474849
Server index
30.0 45
27.5 OUR-B OUR-B
Average path length
Fig. 2.16 shows the results of all methods under either global retrieve or local
retrieve. First, the average path length of local retrieve is longer than that of global
retrieve, which is the same conclusion from previous simulations. Second, all methods
with multiple replicas perform much better than the single replica does. Compared
to the single replica strategy, OUR-R reduces the average path length up to 36%.
This confirms that data replication can significantly reduce the average path length
of data requests. Third, both with multiple replicas, OUR-R performs better than
OUR-R-fixed, This shows the advance of using carefully designed number replica
estimation with data popularity over fixed number replicas (evenly distributed among
data items). Fourth, In global retrieve, OUR-R-fixed performs better than Random-
R-fixed, which shows the proposed placement with the Voronoi diagram is much better
than random placement. However, OUR-R-fixed performs worse than Random-R-
fixed in local retrieve since the Random-R-fixed method may place multiple replicas
32
near the request server randomly. In summary, among all methods with multiple
replicas, our proposed OUR-R has the best results.
33
The problem is modeled as a 0–1 integer programming problem to consider the data
dependency, data reliability, and user cooperation, and then solved by an intelligent
swarm optimization. Similarly, Lin et al. [55] also proposed a self-adaptive discrete
particle swarm optimization algorithm to optimize the data transmission time when
placing data for a scientific workflow. Li et al. [45] investigated a joint optimization
of data placement and task scheduling in edge computing to reduce the computation
delay and response time. For the data placement optimization, the authors considered
the value, transmission cost, and replacement cost of data blocks, and the formulated
optimization problem is solved by a tabu search algorithm designed for the knapsack
problem. However, again these optimization-based methods usually suffer from poor
stability and high overheads. Breitbach et al. [12] have also studied both data
placement and task placement in edge computing by considering multiple context
dimensions. For its data placement part, the proposed data management scheme
adopts a context-aware replication, where the parameters of the replication strategy
are tuned based on context information (such as data size, remaining storage, stability,
and application).
Most recently, Huang et al. [30] have studied caching fairness for data sharing in
edge computing environments. They propose fairness metrics to take resources and
wireless contention into consideration and formulate the caching fairness problem as
an integer linear programming problem. Then they propose an approximation algo-
rithm based on the connected facility location algorithm and a distributed algorithm.
Xie et al. [123] studied the data-sharing problem in edge computing and proposed
a coordinate-based data indexing mechanism to enable efficient data sharing in edge
computing. It maps both switches and data indexes into a virtual space with asso-
ciated coordinates, and then the index servers are selected for each data based on
the virtual coordinates. Their simulations showed that both the routing path lengths
and forwarding table sizes for publishing/querying the data indexes are efficient. Xie
et al. [122] further extended their virtual-space method to handle data placement
and retrieval in edge computing with an enhancement based on centroidal Voronoi
tesselation to handle load balance among edge servers. Both [123] and [122] inspire
our work on data placement with data popularity (adopting a virtual-space-based
placement method with greedy routing-based retrieve), but they do not consider the
data popularity of data items.
34
Note that there are other types of resource management problems in edge comput-
ing, such as virtual network function placement [90, 130], service placement [79, 22],
and cloudlet placement [140, 139, 129]. These problems are different from the data
placement problem, and their solutions could not solve the considered data placement
problem here.
35
CHAPTER 3
3.1 Introduction
With the proliferation of Internet of Things (IoT) data and innovative mobile
services, there is a growing demand for low-latency access to resources such as data
and computing services. Mobile edge computing has evolved into an effective com-
puting paradigm for meeting the need for low-latency access by locating resources
and dispatching tasks at edge clouds close to mobile users.
As shown in Fig. 3.1, a typical edge computing environment consists of mobile
users, edge clouds (including multiple edge servers connected by the edge network),
and a remote cloud (usually within data centers). Each edge server is generally de-
ployed at the network edge near mobile users and owns specific storage, CPU, and
memory capacity. Mobile users can generate a couple of computation tasks at any
location which requests to be dispatched at edge servers with sufficient resources (i.e.,
internal computation resources such as CPU, memory, storage) and may also require
certain data or services (i.e., external resources such as training data or machine
learning services). Note that the types of computing tasks from mobile users/devices
are heterogeneous due to diverse settings and applications. For example, some tasks
may only request data (e.g. image, video) or machine learning (ML) model from the
edge network, and then process it locally or perform ML computation based on the
model at the local edge server. Some tasks may request computation at other edge
servers with certain computation services, such as video analysis, speech recognition,
and 3D rendering. Some tasks may need a combination of data, services, and com-
putation resources, such as distributed federated learning or interactive augmented
reality. Fig. 3.1 shows some examples where tasks from mobile users request either
data/services or both. Note multiple user tasks can be served by the same edge server
and the deployment of multiple copies of resources can usually reduce the accessing
cost or balance loads among servers. The diverse types of tasks from mobile users and
36
Edge Clouds Remote
Cloud
data E service X
data C, E
Edge Server 4
Server
Server 2
task 1 da data D
data A, B ta service Y
E
Edge Network
Server 1 Server 5 task 2
task 3
data B
Base
Station Server 3
task 3: request service Y
User 1
Mobile Users
task 2: request data E and service Y
User 2
37
3.2 System Models and The Optimization
In this section, we first introduce our network and system models under a general
edge computing architecture. Then we formulate the resource placement problem,
the task dispatching problem, and the joint optimization problem, respectively.
Then, the requested resource set Ωk = {qj |ωk,j = 1}, and its input resource size αk
can be calculated as αk = O
P
j=1 ωk,j oj . Note that the resource requested by task uk
38
each server vi , we also assume there is a status indicator stti to represent whether this
server is available at time t (available when stti = 1, not available when stti = 0). There
are two possible causes of unavailability: predictable (such as scheduled updates or
maintenance) or sudden events (such as power outages). Here we mainly consider
the first type of cases. For the latter case, different backup strategies should be
considered.
Here, we assume that data items or services can have replicas in edge cloud (i.e.
PN t
i=1 xj,i can be larger than 1). In addition, an edge server may store multiple data
and services, but the total storage size placed in edge server vi cannot exceed its
current remaining storage capacity:
O
X
xtj,i oj ≤ stti · cci , for all vi . (3.2)
j=1
For services, there are also specific CPU and memory requirements on the placed
server.
xtj,i ζr ≤ stti · fi , for all vi , qj . (3.3)
The resource placement aims to maximize the total benefit minus the total cost
from all serving tasks while satisfying resource constraints. Here, we consider two
types of costs from serving tasks: placement cost and accessing cost.
For the placement cost of a resource item qj to a server vi during the placement,
we consider two possible ways: (a) directly downloading from the cloud with a cost
of ϖj , or (b) transferring from a nearby server vk , which holds a copy of qj at t − 1,
39
with a cost of f (qj , vk , vi ). Here, assume that Pj is the shortest path in Gt connecting
vk and vi 1 , then the cost f (qj , vk , vi ) can be defined as follow.
0, if vi = vk
f (qj , vk , vi ) = P (3.5)
oj
el ∈Pj ( bl + pl ), otherwise.
Thus, the placement cost of qj to vi at t is the minimal among all these, i.e.,
t
0, if xt−1
j,i = 1
pcj,i = (3.6)
min(ϖ , min (xt−1 f (q , v , v ))), otherwise.
j k̸=i j,k j k i
t
For the accessing cost of resources after the data/service is placed, let σj,k be
the accessing cost for resource qj required by task uk . Note that the accessing cost
depends on which edge server task uk is processed at. Let Υk = Υ(uk ) be the server
assigned by the tasking dispatching of uk . The accessing cost of qj can be defined as
t
σj,k = min xtj,i f (qj , Υk , vi ). (3.9)
vi ̸=Υk
If without task dispatching, we assume that task uk is processed at its arriving server
Ψk , then the accessing cost is
t
σj,k = min xtj,i f (qj , Ψk , vi ). (3.10)
vi ̸=Ψk
t
σ̄j,i = min xtj,l f (qj , vl , vi ). (3.11)
l̸=i
1
Here Gt represents the edge network formed by all available servers at time t. The shortest path
is defined regarding the summation of propagation and transmission delays of qj over the path.
40
Since each of serving tasks has benefit of ρk , the utility of each task uk can be
t
P
defined as j (ρk − ωk,j · σj,k ).
Now we can formulate the resource placement problem as an optimization problem.
The objective is to maximize total utilities from all serving tasks minus the summation
of accessing costs for all resources at time t.
XX X
t
max (ρk − ωk,j · σj,k )− ωj · νjt
k j j
X
s.t. xtj,i oj ≤ stti · cci , ∀i
j
PN t
Here we assume that each task is at most dispatched to a single server, i.e., i=1 yk,i ≤
1.
Note that there are different types of tasks: some only need data from the edge
network, some only need to perform general computation at any server either with
data or not, and some need to perform specific computation with certain services
at the available server. Our formulation can model all these task types. If task uk
41
only needs data, γk = 0, δk = 0 while αk > 0. If uk only needs general computation
without specific service or data, γk > 0, δk > 0 while αk = 0.
t
Assume that task uk is dispatched to edge server vi , i.e., yk,i = 1, then its associ-
ated costs are defined as follows.
Accessing cost of resources: The transmission cost of input data and needed
input
= O t
P
service for task uk is defined as Ck,i j=1 ωk,j · σ̄j,i .
Computation cost:. Let ξk (z) be the function to define CPU cycles to process
task uk with the input data/service size z. So the computation cost of task uk
ξk (oj )
comp
= O
P
processed in edge server vi is defined as Ck,i j=1 ωk,j · fi .
Transmission cost of output: The total transmission cost of output data for
output
task uk from edge server vi to arriving edge server Ψk is Ck,i = f (βk , vi , Ψk ).
Therefore, the completion cost of task uk is calculated as
Recall each task has a benefit ρk . We then can formulate the task dispatching
decision as an optimization problem whose goal is to maximize the total task utility
if task uk is running on the server vi at t.
XX
t t
max yk,i (ρk − ςk,i )
k i
X
t t
s.t. yk,i ςk,i ≤ τ, ∀i
k
t
yk,i αk ≤ stti · cci , ∀i, k
t
yk,i γk ≤ stti · fi , ∀i.k (3.15)
t
yk,i δk ≤ stti · mi , ∀i, k
X
t
yk,i ≤ 1, ∀k
i
t
zk,i ∈ {0, 1}, ∀k, ∀i
i ∈ (1, 2, . . . , N ), k ∈ (1, 2, . . . , Z)
t t
P
Note that the constraint of k yk,i ςk,i ≤ τ makes sure that the dispatched tasks can
be completed within the duration of a time scale τ .
42
3.2.4 Joint Optimization Problem
We now consider a joint resource placement and task dispatching problem as a
nonlinear program problem:
XX X
t t
max yk,i (ρk − ςk,i )− ωj · νjt (3.16)
k i j
X
s.t. xtj,i oj + yk,i
t
αk ≤ stti · cci , ∀i, k (3.17)
j
43
problem (obtaining xt,1
j,i ) to maximize the total task utilities. Next, we take the
Stage 2: Solving task dispatching problem with fixed resource placement. In this
stage, we take the resource placement decision xt,ι
j,i generated in the first stage as input
t,ι
and determine the task dispatching for each task yk,i to maximize the total utility.
The problem can be formulated as P2:
X X t,ι X
t
max yk,i (ρk − ςk,i )− ϱtj
k i j (3.28)
s.t. (3.17), (3.20) − (3.26)
t,ι
The solution of this stage is yk,i .
After the decomposition, in each round, both P1 and P2 are linear integer pro-
gramming problems, and thus can be solved by the classical linear programming
methods (e.g., dynamic programming, branch and bound).
Overall Iteration, Initialization, and Termination: Algorithm 3 shows the
t,0
overall algorithm. Initially, a feasible random task dispatching yk,i is generated (Line
2). Then, in each round (Lines 5-12), we solve the P1 and P2 with the previous
decision as the input. The resource placement and task dispatching decisions (xt,ι
j,i
t,ι
and yk,i ) are optimized iteratively. Finally, the iteration terminates (Line 13) when
44
Algorithm 3 Two Stages Optimization Method
Input: Status of all servers V and the network G, resources Q and tasks U for
time t.
Output: Resource placement and task dispatching decisions xtj,i and yk,i
t
.
1: Initialize max itr, max occur, bound val
t,0
2: Generate an random initial task dispatching decision yk,i which is feasible (i.e.,
satisfying constraints in P2)
3: ι = 1 and count num = 0;
4: repeat
5: Stage 1: Calculate xt,ι t,ι−1
j,i by solving P1 with yk,i as the fixed task dispatching
t,ι
6: Stage 2: Calculate yk,i by solving P2 with xt,l
j,i as the fixed resource placement,
let obj val be the achieved objective value (total utility from tasks)
7: if obj val > bound val then
8: bound val = obj val; count num = 1
9: xtj,i = xt,ι t t,ι
j,i ; yk,i = yk,i
either of the following metrics is met: (1) the number of iteration reach a certain
threshold max itr, or (2) the current objective value (total task utility) has occurred
more than a specified threshold max occur. These two thresholds can be set via
experiments. Obviously, larger threshold values lead to longer iteration but improved
results.
45
Fast Timescale
Task Dispatching
Time Slot 1
Time Slot 2
Time Slot 1
Time Slot 2
Time Slot 1
Time Slot 2
Time Slot χ
Time Slot χ
Time Slot χ
… … … …
Slow Timescale
Joint Resource Placement and Task Dispatching
Figure 3.2: Illustration of joint resource placement and task dispatching across two
timescales.
in the edge network, while the resource placement could be adjusted (such as rede-
ploying or migrating services) less frequently on a slow timescale. Compared with the
single timescale method, multi-timescale solutions [22, 132] can achieve better per-
formance with more flexible management, thus gaining significant attraction recently
from the research community.
Our proposed two-stage algorithm can be easy to adopt to a two-timescale solu-
tion. As illustrated in Fig. 3.2, we can make task dispatching decisions along with the
fast timescale (at the starting point of each time slot) and make resource placement
decisions along with the slow timescale (at the starting point of each time frame).
Here, we assume that each time frame includes χ time slots. More specifically, at
the beginning of each time frame, we run our proposed iterative two-stage algorithm
(Algorithm 3), and at the beginning of each time slot (except for the first time slot),
we only solve the Stage 2 problem (P2) where the resource placement is fixed. By
doing so, not only we can handle diverse dynamics among workload and resources,
but also the running time of the overall algorithm is reduced since the iterative algo-
rithm is only performed once at each time frame, and solving P2 at each time slot is
relevantly simpler. Thus, it leads to greater flexibility with more cost savings.
46
3.4 Reinforcement Learning based Method
In this section, we consider an alternative method to solve joint optimization
by leveraging the emerging deep reinforcement learning technique. Reinforcement
learning (RL) has a great capability to attack complex optimization problems in a
dynamic system. The characteristic of the RL framework is that the decision is made
by RL agents and the feedback generated by the environment is used to improve the
decision of the agent. There are three key elements in the RL frameworks: state,
action, and reward.
Generally, RL algorithms can be classified as categories of value-based and policy-
based methods. Value-based RL methods (e.g. Q-learning, Deep Q-network (DQN)
[69], Double DQN [104]) can select and evaluate the optimal value function with
lower variance. The value function measures the goodness of the state (state-value)
or how good is to perform an action from the given state (action-value). However, it
is difficult for value-based methods to handle the problem of continuous action spaces.
If it calculates the value in an infinite number of actions, it will be time-consuming.
On the other hand, policy-based methods, such as policy gradient [53], are effective
in high-dimensional or continuous action spaces. It can learn stochastic policies and
has better convergence properties. The main idea is to able to determine at a state
which action to take in order to maximize the reward. The way to achieve this
objective is to find and tune a vector of parameters (θ) so as to select the best action
to take for policy π. The policy π is the probability of taking action a when at state
s and the parameters are θ. There are some disadvantages to policy-based methods:
(1) it typically converges to a local rather than global optimum; (2) evaluating a
policy is typically inefficient and has high variance.
Actor-Critic RL method [68] is proposed to combine the basic idea of value-based
and policy-based algorithms. The actor uses policy-based methods to select the
action while the critic uses value-based methods. As shown in Fig.3.3, the actor
takes the state as input and outputs the best action. It essentially controls how the
agent behaves by learning the optimal policy (policy-based). The critic, on the other
hand, evaluates the action by computing the value function (value-based). And the
feedback (such as an error) will tell the actor how good its action was and how it
should adjust. However, since the actor-critic method involves two neural networks,
47
Actor-Critic Framework
feedback
action
Actor Network
(Policy-based)
Environment
state
each time the parameters are updated in a continuous state and there is a correlation
before and after each parameter update, which causes the neural network to only look
at the problem one-sidedly, and even causes the neural network to learn nothing. To
avoid such a problem in our problem, we leverage Deep Deterministic Policy Gradient
(DDPG) RL technique [97, 53] to solve the joint optimization problem.
• cri : available computing resources (e.g., storage, CPU, memory) of each edge
server.
Let SS be the state space, the system state ssι ∈ SS at step ι can be defined as
Action Vector: In terms of the action vector, the agent will make decisions for
both resource placement and task dispatching. The decision mainly consists of where
48
to place resources and where to dispatch tasks. Therefore, the action vector includes
two parts.
Let AA be the action space, the system action aaι ∈ AA at step ι can be defined as
Reward: For each step, the agent will get the reward rrι from the environment
after taking a possible action aaι . Generally, the reward function is related to the
objective function in the optimization problem. Fortunately, the objective of our
optimization problem is to maximize the total utility of all tasks, so the award of the
RL agent is set as follows.
XX X
t
rrι = (ρk − ςk,i )− ϱtj . (3.29)
k i j
Notice that the reward rrι can be obtained given the agent’s action aaι , which includes
the solution of both resource placement and task dispatching, and the environment.
49
Actor Network (Policy) Critic Network (Value)
② ②
Optimizer Optimizer
Policy gradient
ssi, aai, θμ ⑧ Update θμ ⑧ Update θQ
⑧
Q gradient
ssi, aai, θQ
gradient
ss, aa, rr
Evaluation Evaluation
Network Network
argument: θμ aa = μ(ssι) argument: θQ
μ(ssι)
Update Update
⑨ ⑨ zi ⑦
θμ' ← θμ ⑥ θQ' ← θQ
Figure 3.4: The architecture of DDPG RL Algorithm. The circled numbers are the
corresponding steps.
1. Initialize the system and environment based on the edge network G, and set of
external resource Q and set of task U as well as other network information.
′
2. Initialize Actor evaluation network µ(s|θµ ) and target network µ′ (s|θµ ) as well
′
as Critic evaluation network Q(ss, aa|θQ ) and target network Q′ (ss, aa|θQ ),
′ ′
where θµ and θQ are evaluation network parameters, θµ and θQ are target
network parameters.
3. Initialize replay buffer D, the maximum number of episodes max ep and the
maximum number of steps per episode max st. D is used to sample experience
to update neural network parameters.
4. At the beginning of each episode, initialize the random exploration noise and
50
generate the initial state ss1 .
5. For each step ι, the actor selects an action aaι based on the current policy and
random noise.
6. The environment executes action aaι and get the reward rrι and observe new
state ssι+1 . Then it stores the transition (ssι , aaι , rrι , ssι+1 ) to D. At the same
time, the actor sends the action to the critic network.
7. Randomly sample a batch of data (ssi , aai , rri , ssi+1 ) from D. Then calculate
the expected value/reward zi .
8. Update Critic and Actor evaluation network with the sampled data.
10. This process is done until it reaches the maximum number of episodes.
51
input
Every specific
timeframe
Environment
RP DDPG
storage capacity Model
CPU frequency
output
Actions
memory capacity
Resource
Placement
resource size
download cost
Figure 3.5: Resource placement and task dispatching via deep reinforcement learning
across two timescales with two DDPG models.
3.5 Evaluation
This section reports the results from our trace-based simulations to evaluate our
proposed strategies.
52
Table 3.1: RL Hyper Parameters
Parameter Value Parameter Value
Max Episode 100 Reward Discount 0.9
Max Step per Episode 3,000 Batch Size 32
Learning Max Episode 10 Soft Replacement 0.01
Actor Learning rate 0.0001 Replay buffer Capacity 10,000
Critic Learning rate 0.0002
2011 traces) [86]. For the external resources (data and services), we randomly gener-
ate 100 data items and 20 services where the size of each resource is from 10MB to
200MB. To simulate the tasks from mobile users, we leverage the user mobility data
from the CRAWDAD dataset kaist/wibro [70], developed by a Korean Team, which
collected the CBR and VoIP traffic from the WiBro network in Seoul, Korea. We
randomly sample from this dataset to generate the random tasks from mobile users
to perform our simulation. We run our experiments on a DELL Precision 3630 Tower
with an i7-9700 CPU, 16GB RAM, and NVIDIA GeForce RTX 2060 GPU. For our
proposed RL-based method, the detail of hyperparameters configuration is reported
in Table 3.1. The parameters are initialized by the general value that is used in most
RL experiments. We test multiple values for each parameter and select the value
that has better performance. We compare our proposed Two Stage Optimization
(OPT) and Deep Reinforcement Learning (RL) solutions with two baselines: a
random strategy and a greedy strategy.
• Greedy (GRD). It greedily determines its resource placement and task dis-
patching decision to maximize total utility in each round. It gives the priority to
resources/tasks based on their popularity/benefits. Specifically, GRD first sorts
resources based on their popularity and process them from the most popular
one. It iteratively selects an edge server to place this resource which maximizes
the total utility in each round. Similarly, for tasking dispatching, GRD sorts all
tasks based on their benefits and processes the most beneficial task first. Like-
wise, it greedily selects an edge server to dispatch the task to get the maximal
task utility in each round iteratively.
53
10000 8000
RAND RAND
4000 5000
2000 4000
0 3000
10 20 30 40 50 10 20 30 40 50
Number of requests Number of edge nodes
(a) number of task requests (b) number of edge servers
Figure 3.6: Overall performance of four methods in one timescale.
We evaluate the performance of all methods based on average total utility (i.e., the
objective function in our formulated optimization problems). Obviously, the larger
the utility value the better resource placement and task dispatching performance. All
parameters required to calculate the objective function (such as network topology,
bandwidth, task requirements, server capacity, and download cost) are known to all
methods as inputs at each time unit. For RL methods, those are used to calculate
the reward at each time unit.
54
15.0
1895
Running time
10.0 OPT
RAND 1885
7.5 GRD 1880
5.0
1875
2.5
1870
0.0
0 5 10 15 20 25 30 0 20 40 60 80 100
Time slots Iteration
(a) running time (b) total utility per slot
Figure 3.7: Running time and convergence of OPT.
utility of RAND increases in the beginning and then less varies as the number of
edge servers increases. For the other three solutions, OPT and GRD vary a little as
the number of servers increases while RL keeps stable all of the ways. Overall, the
performance of most of the solutions is relevantly stable, especially RL. For all cases,
RL and OPT perform much better than GRD and RAND. This once again confirms
the advantage of our two proposed methods.
55
Average total utilities
6000 7000
Slow Timescale Always ON
5800
bot AND
OPT RAND
T
D+ D
D
bot OPT
OPT GRD
RAN + GRD
D + OPT
RAN + OP
GR + GR
RAN
hR
4600
h
bot
D
D
Two Stages Random Greedy
GR
Solutions Solution Combinations
(a) single vs two timescales (b) dynamic vs static status
Figure 3.8: Performance of OPT across two timescales with dynamic status.
trade-off between the max iteration and the running time as well as the optimization
objective value since more iterations cost more running time.
56
also suffers from frequent resource placement changes which might be costly. Third,
when the solutions are performed across two timescales, the performances can be
further improved. This might be due to performing task dispatching at the time slot
can find a sufficient server to perform the task and quickly release the server for other
tasks. Overall, the results from this set show that a multi-timescale solution can
achieve better performance compared with the single timescale method, which echos
a similar discovery from [22, 132] (though the studied problems and network models
are different).
Finally, we evaluate our proposed two-timescale solutions over edge servers with
dynamic status by leveraging the status trace-driven data from the Google Cluster
Data (ClusterData 2011 traces) [86]. We use the trace data to generate the server
status at different time slots. Other parameters are similar to previous experiments.
For two-timescale solutions, we use different combinations of OPT/GRD/RAND
to solve data placement and task dispatching problems respectively. As shown in
Fig. 3.8(b), there are nine combinations in total. For example, OPT+RAND means
the optimization-based method is used for data placement, while task dispatching is
done randomly. Fig. 3.8(b) reports the results of these methods under three different
scenarios: (1) Always On: assume that all edge servers are always running and avail-
able for serving tasks; (2) Dynamic Status: the status of the edge node varies along
with the time slot, while a server is down at a time slot no task can be dispatched to
it; (3) Static Status: our method completely ignore the server status during solving
the data placement and task dispatching. Obviously, all combinations with dynamic
status have lower total utility than those of always on, since some servers may be
unavailable in certain time slots. In addition, if ignoring the status, the performance
(of static status) will be significantly reduced, since the dispatched tasks may not be
completed due to the server being unavailable. Clearly, our solutions which consider
dynamic status can achieve a comparative performance in the case where every server
is on. Last, among all nine combinations, using our optimization-based solution for
both resource placement and task dispatching across two timescales has higher per-
formance than other combinations. This indirectly illustrates the effectiveness of the
two-stage algorithm under two timescales to handle real dynamics in edge computing,
which is the major contribution of this paper.
57
1.490 1e6 1e6
1.485 1.564
1.480 1.562
Rewards
Rewards
1.475 1.560
1.470
1.558
1.465
1.460 1.556
0 20 40 60 80 100 0 20 40 60 80 100
Episodes Episodes
(a) single timescale (b) across two timescales
Figure 3.9: Convergence of RL under different timescales.
58
1.490 1e6 1e6
1.490
1.485 1.485
1.480 1.480
Rewards
Rewards
1.475 1.475
1.470 1.470
1.465 batch size = 32 1.465 LC_A = 0.0001, LC_C = 0.0002
batch size = 64 LC_A = 0.0005, LC_C = 0.001
1.460 batch size = 128 1.460 LC_A = 0.001, LC_C = 0.002
0 20 40 60 80 100 0 20 40 60 80 100
Episodes Episodes
(a) different batch size (b) different learning rate
Figure 3.10: Convergence of RL under different batch size and learning rate.
different learning rates will lead to different convergence results so we have to select
an appropriate learning rate for our RL model.
59
scheduling in edge computing to reduce the computation delay and response time.
Their formulated optimization considers the value, transmission cost, and replacement
cost of data blocks, which is then solved by a tabu search algorithm. Breitbach et al.
[12] have also studied both data placement and task placement in edge computing by
considering multiple context dimensions. For its data placement part, the proposed
data management scheme adopts a context-aware replication, where the parameters
of the replication strategy is tuned based on context information (such as data size,
remaining storage, stability, application). Huang et al. [30] have studied caching fair-
ness for data sharing in edge computing environments. They formulate the caching
fairness problem, where fairness metrics take resources and wireless contention into
consideration, and propose both approximation and distributed algorithms. Xie et
al. [123] also studied the data-sharing problem and proposed a coordinate-based data
indexing mechanism to enable the efficient data sharing in edge computing. It maps
both switches and data indexes into a virtual space with associated coordinates, and
then the index servers are selected for each data based on the virtual coordinates.
Xie et al. [122] further extended their virtual-space method to handle data placement
and retrieval in edge computing with an enhancement based on centrodial Voronoi
tesselation to handle load balance among edge servers. Similarly, Wei et al. [118, 119]
proposed another virtual-space based data placement strategy which takes the data
popularity of data items into consideration during the virtual-space mapping, data
placement and retrieval. There are solutions [65] for data management issues in edge
computing as well.
Similar to data placement, service and resource placement in edge computing has
been studied as well. Ouyang et al. [78] proposed an adaptive user-managed ser-
vice placement algorithm to jointly optimize the latency and service migration cost.
By formulating the service placement problem as a contextual Multi-armed Bandit
problem, they proposed a Thompson-sampling based online learning algorithm to
explore make adaptive service placement decisions. Xu et al. [126] studied the ser-
vice caching in mobile edge clouds with multiple service providers completing for
both computation and bandwidth resources, and proposed a distributed and stable
game-theoretical caching mechanism for resource sharing among the network service
providers. Pasteris et al. [79] also studied a multiple-service placement problem in
a heterogeneous edge system and proposed an approximation algorithm for placing
60
multiple services to maximize the total reward. Meskar and Liang [67] proposed a
resource allocation rule retaining fairness properties among multiple access points,
while Zhang et al. [134] proposed a decentralized multi-provider resource allocation
scheme to maximize the overall benefit of all providers. Resource placement has also
been considered jointly with other design issues in edge networking and computing.
For example, Kim et al. [37] designed a joint optimization of wireless MIMO signal
design and network resource allocation to maximize energy efficiency in wireless D2D
edge computing. Eshraghi and Liang [20] considered the joint optimization of com-
puting/communication resource allocation and offloading decisions of uncertain tasks
in mobile edge networks.
61
performance. Yang et al. [129] proposed a Benders decomposition-based algorithm
to jointly solve the cloudlet placement and task allocation problem while minimizing
the total energy consumption.
However, most of these works consider a kind of joint optimization at a single
timescale, and thus may not handle the dynamic among tasks, resources, and com-
putation facilities in the edge computing environment. Recently, Farhadi et al. [22]
studied service placement and request scheduling problems in edge cloud environ-
ments for data-intensive applications and proposed a two-timescales framework to
determine the near-optimal decision under specific constraints. You et al. [132] also
studied a joint resource provision and workload distribution problem in a mobile edge
network. They formulated the problem as a nonlinear mixed-integer program to min-
imize the long-term cost, and proposed online learning-based algorithms to solve the
problem in two timescales. Our work is inspired by these works, but we consider
different joint optimization with different network and edge settings. In addition, we
also leverage deep reinforcement learning to solve joint optimization.
62
computing and network resources to reduce the average service time and balance re-
source usages under a dynamic edge network. Ning et al. [75] solved the joint task
scheduling and resource allocation optimization in vehicular edge system to maximize
users’ Quality of Experience (QoE) by using a two-sided matching scheme for task
scheduling and a DRL approach for resource allocation respectively. Nath and Wu
[72] considered the computation offloading and resource allocation in a cache-assisted
edge system, and proposed a DDPG-based scheduling policy to minimize the long-
term average cost including energy consumption, total delays and resource accessing
cost. Meanwhile, Rahman et al. [84] also studied the joint problem of mode selec-
tion, resource allocation, and power allocation to minimize the total delay in the fog
radio access networks using DRL methods. While many of these works adopt DRL
to successfully optimize task scheduling/offloading and/or resource allocation, they
usually use one DRL agent to learn the dynamic. In our work, our DRL method has
been extended to work across two timescales.
63
CHAPTER 4
4.1 Introduction
Mobile users, Internet of Things (IoT) devices, and artificial intelligence applica-
tions generate massive amounts of data today, providing potential training datasets
for a variety of machine learning (ML) tasks. Traditionally, for centralized machine
learning model training, the entire dataset is uploaded to a remote cloud center. How-
ever, due to limited network bandwidth and data privacy concerns, uploading a large
amount of data to a remote data center is not trivial. Edge computing combined
with distributed machine learning is a natural alternative because training data is
generated at the network edge, such as from smart sensing devices and smartphones
connected to the network edge. Nonetheless, there are numerous challenges to train-
ing ML models in the edge cloud. First, due to limited data and computing resources,
a single edge device/server may be incapable of performing a high-quality ML model
training task on its own. Second, edge devices/servers’ computing capacity and net-
work resources are limited and heterogeneous. Different edge units may result in
varying convergence speeds and performances when performing ML training tasks.
Third, edge resources are typically shared by a large number of mobile users or ap-
plications. The shared resources and competition among various users, edge servers,
and applications must constrain distributed ML training within the edge cloud.
To tackle the aforementioned challenges, a new distributed machine learning
paradigm has been proposed, called federated learning (FL) [64, 34, 91] that con-
ducts distributed learning at multiple clients without sharing raw local data among
themselves. Coupled with edge computing, FL over edge cloud has been recently
studied in various settings [54, 56, 108, 76, 58, 35, 66, 113, 116, 74]. In such a sce-
nario, several edge servers have been selected as participants (either parameter servers
or FL workers), and collaboratively train a shared global ML model without sharing
their local dataset and decoupling the ability to do model training from the need to
64
Model 1
FL Model 1
PS 1
Worker 1
PS 2 Worker 2
Worker 1
Model 2 Worker 2
Edge Server
Training Data
Worker 3 FL Model 2
FL Models
store data in a centralized server. More precisely, as shown in Fig. 4.1, in each global
iteration, edge servers, worked as workers, first download the latest global model from
the parameter server (PS), and then perform a fixed number of local training based on
their local data. After that, edge servers will upload their local model to the parame-
ter server which is responsible for aggregating parameters from different workers and
sending the aggregated global model back to each FL worker. Previously, the efforts
of FL over edge have been focused on the convergence and adaptive control [108, 56],
the resource allocation and model aggregation [58, 113, 66], the communication and
energy efficiency [64, 131, 48].
In this chapter, we focus on a joint participant selection and learning optimization
problem in multi-model FL over a shared edge cloud1 . For each FL model, we aim
to find one PS and multiple FL workers and decide the local convergence rate for FL
workers. Note that both worker selection and learning rate control have been studied
in FL recently. With heterogeneous resources and capacities at edge devices, when
multiple FL models are trained at the same time, which FL model is preferentially
served at which edge server directly affects the total communication cost and compu-
tational cost of the FL training. The selection of participants (both the PS and FL
workers) for each model will also affect the learning convergence speed. Hence, we
aim to carefully select the FL participants for each FL model and pick the appropri-
1
As shown in Fig. 4.1, we consider an edge cloud architecture where a set of edge servers are
connected to each other without the remote cloud center to form an edge network to serve the users.
65
ate local learning rate for these selected FL workers, so as to minimize the total cost
of FL training of all models while meeting the convergence requirement from each
model.
66
𝒕 𝒕+𝟏
Worker 4 Worker 4
Worker 1
PS
Worker 1 Worker 3
Worker 3
PS
Worker 2
Worker 2
Model download and process 𝝑𝒕𝒋 global iterations Process 𝝋𝒕𝒋 local updates
Figure 4.2: The training process of an FL model within the edge network at different
time periods.
uses a fixed number of workers, and one worker can only perform FL training of one
model at one time.
We consider a series of consecutive time periods t = 1, · · · , T , and each time
period has an equal duration τ . As shown in Fig. 4.2, at each time t, we select the FL
participants for each model and then train W models in parallel through FL, which
consists of a number of global iterations (let ϑtj be the number of global iterations
of mj at t). For each model mj , each global iteration includes four parts: (1) the
selected parameter server initializes the global model of mj ; (2) the selected workers
download the global model from the parameter server; (3) each worker runs the local
updates using its holding raw dataset for φtj local iterations to achieve the desired
local convergence rate ϱtj ; (4) workers upload the updated model and related gradient
to the parameter server for the aggregation to upload the global model. The process
of federated learning at different time periods is shown in Fig. 4.2.
Next, we define our local training and global aggregation process as well as the
loss function during the federated edge learning at each time period.
Loss Function: Let all types of sample data used by jth model and stored in edge
t
S
server vi be defined by Dj,i = wj,k oi,k =1 Si,k . For each sample data d =< qd , rd >∈
t
Dj,i , where qd is the input data and rd is the output data/label, we define the average
67
loss of data for jth FL model on the server vi in time period t as Atji (p):
1 X
Atj,i (p) = t
H(I(qd ; p), rd ),
|Dj,i | t d∈Dj,i
where H(· ) is the loss function to measure the performance of the training model,
I(· ) is the training model and p is the model parameter.
Then the average loss of data for jth FL model on all related edge servers in time
period t is defined as follows:
t
X |Dj,i |
Atj (p) = Atji (p),
i
|Djt |
where Djt is the union of all involved training samples of model j at time t.
Local Training on FL Workers: For each global iteration of jth FL model
α ∈ [1, ϑtj ], the related edge server vi (FL worker) will perform the following local
update process:
pt,α t,α−1
j,i = pj
t,α
+ ωj,i ,
where pt,α
j,i is the local model parameters on edge server vi in the current iteration
and pt,α−1
j is the aggregated model downloaded from the parameter server in the last
t−1,ϑtj
iteration. And pt,0
j = pj
t,α
. ωj,i is the local update from a gradient-based method
and it can be calculated as follows.
φt
j jφt
t,α
X t,α,β
X t,α,β−1
ωj,i = ωj,i = {ωj,i − δ∇Lt,α t,α,β−1
j,i (ωj,i )},
β=1 β=1
t,α,β
where ωj,i is the model parameters for jth FL model in β-th local update and δ
is the step size of the local update. Lastly, Lt,α
j,i (· ) is the predefined local update
Lt,α t t,α−1
j,i (ω) = Aj,i (pj + ω) − {∇Atj,i (pt,α−1
j )
ξ2
− ξ1 Jjt (pt,α−1
j )}⊤ ω + ||ω||2 ,
2
X X
t t,α t t,α t
Jj (pj ) = ∇Aj,i (pj )/ yi,j ,
i i
where ξ1 and ξ2 are two constant variables. Jjt (· ) is the sum of gradients among all
related edge servers and this process will be performed in the global aggregation step.
68
Assume that Atj,i (· ) is λ-Lipschitz continuous and γ-strongly convex [14, 131],
then the local convergence of local model is represented as
t,φt
Lt,α j t,∗ t t,α t,0 t,∗
j,i (ωj,i ) − Lj,i ≤ ϱj [Lj,i (ωj,i ) − Lj,i ], (4.1)
where Lt,∗
j,i is the local optimum of the training model. Furthermore, we can set
t,0
ωj,i = 0 since the initial value can start from 0 for the training model.
Global Aggregation on Parameter Server: After the local updates for all
t,α
related FL workers, they have to upload the related local model parameter ωj,i and
the related gradients ∇Atj,i (pt,α
j ) to the parameter server for aggregation.
X X
pt,α
j = p t,α−1
j + {y t
i,j ω t,α
j,i }/ t
yi,j .
i i
parameter and γ is the γ-strongly convex parameter. Both the value of λ and γ are
2λ2
determined by the loss function. ϑ0 and φ0 are two constants where ϑ0 = γ 2 ξ1
and
2
φ0 = (2−λδ)δγ
.
69
4.3 Joint Participant Selection and Learning Op-
timization Problem
mj at time t. We will use ϱtj and ςj to control the number of global iterations and
local updates for model mj at time t. Recall that ςj is given by the model mj as a
t
requirement, thus only ϱtj is used for optimization. Overall, xti,j , yi,j , and ϱtj are the
decision variables of our optimization in each time period t.
We now formulate our participant selection problem in multi-model FL where we
need to select the parameter server and workers for each model as well as achieve the
desired local convergence rate. The objective of our problem is to minimize the total
cost of all FL models at time t under specific constraints.
W
X
min ϖjt (4.3)
j=1
Here, ϖjt is the total FL cost of jth FL model in time t, which will be defined in
the next subsection. Constraints (4.4) and (4.5) make sure that the storage and
70
CPU satisfy the FL model requirements. Constraint (4.6) ensures that the edge
server stores the dataset that matches the FL model. Constraint (4.7) guarantees the
number of parameter server and FL worker of each model is 1 and κj , respectively.
Constraint (4.8) ensures that each edge server only trains one FL model and can
only play one role at one time. The decision variables and their ranges are given in
(4.9). With a nonlinear learning cost, this formulated optimization is a mixed-integer
nonlinear program (MINLP) problem, which is challenging to solve directly.
71
Problem Decomposition
$ $
Decompose to Two Solve (!!,# , "!,# ) and ##$ in
sub-problems (P4, P3) P4 and P3, respectively
Algorithm Design
Decisions: !!,#
Three-Stage Optimization $
Algorithm (THSO) Stage 1: Given "!,# , ##$ , solve !!,#
$
;
$ $ $
Stage 2: Given !!,# , ## , solve "!,# ;
Three-Stage $
Stage 3: Given !!,# $
, "!,# , solve ##$
Greedy Algorithm (GRDY)
$
, "!,#
Figure 4.3: The problem decomposition and design of our proposed multi-stage
algorithms.
If the FL model is the first time to be trained or has not been updated last time
period, the selected parameter server has to download the model mj with cost ηj .
If the parameter server stays the same from the last time period, there is no cost.
Otherwise, the new parameter server needs to either download the model or transfer
the model from the previous server. Now, the total cost of jth FL model in time t is
given by
72
4.4 Our Proposed Methods
Three-Stage Decomposition
73
Algorithm 4 Three-Stage Optimization Method
1: Initialize max itr, max occur, bound val
t,0
2: Generate an random initial FL worker selection decision yi,j and local convergence
rate ϱt,0
j
Three-Stage Methods
After we decompose the original problem into three sub-problems, we can solve
each sub-problem by either using the linear programming technique or greedy heuris-
tics. The basic idea shared by these methods is as follows. First, we randomly generate
t,0
FL worker selection decision yi,j and the local convergence rate ϱt,0
j , then solve the
74
Algorithm 5 Three-Stage Greedy Method
1: Initialize max itr, max occur, bound val
t,0
2: Generate an random initial FL worker selection decision yi,j and local convergence
rate ϱt,0
j
6: Stage 2: Calculate the total cost of each potential edge server for each FL
model, sort the list in ascending order and greedily select the first κj edge
t,ι
servers to get yi,j with the latest xt,ι t,ι−1
i,j and fixed ϱj
75
Note that during the first two stages of Algorithm 5, we need to select the PS
or workers for all models in a certain order. Obviously, the processing order of each
model may affect the final performance. By default, we simply process them in a first
come first serve mode, i.e., first find the solution for the model that arrives earlier.
Due to the heterogeneity of edge servers in the real edge cloud, some edge servers may
have more sufficient resources (storage and computing capacity) while others do not.
In such a resource-limited scenario, serving the more complex FL model first may
reduce the total completion cost of FL of all models. Therefore, we also introduce a
variation greedy method in which the FL models are sorted based on their model sizes
and we process the model based on a larger model first in both the first and second
stages of Algorithm 5. In this variation, the more complex FL model will first have
more chance to select more high-performance workers leading to a lower total cost. In
our experiments, we evaluated the impact of these two different processing orders. In
addition, other ordering methods can also be applied to our proposed method, such
as choosing the model that requests more resources first.
76
Algorithm 6 Two-Stage Optimization Method
1: Initialize max itr, max occur, bound val
2: Generate an random initial local convergence rate ϱt,0
j
Note in Algorithm 5, time complexity of Stage 1 and Stage 2 are bounded by O(N ·
Tcost · W ) and O( 1ϵ · Tcost · W ), respectively.
77
16
4
3
14
12
17
20 8 11
1
9
6 15
7
19
2
13 18
10
5
78
Table 4.1: Parameters Setting for Edge Cloud and FL
Parameter Value or Range
Edge Cloud Parameter
# of edge servers N 20 ∼ 40
vi ’s storage capacity ci 512 ∼ 1, 024GB
vi ’s CPU frequency fi 2 ∼ 5GHz
ei ’s link bandwidth bi 512 ∼ 1, 024Mbps
# of different dataset O 5
each dataset size |Si,k | 1 ∼ 3GB
# of time period T 30
Federated Learning Parameter
# of FL models W 1∼5
# of mj ’s FL workers κj 1∼7
mj ’s model size µj 10 ∼ 100MB
mj ’s CPU requirement χj 1 ∼ 3GHz
mj ’s downloading cost ηj 1∼5
mj ’s global convergence reqs. ςj 0.001 ∼ 0.1
constant FL variables ϑ0 and φ0 15, 4
3 ∼ 7. Each FL task has a specific model size µj , CPU requirement χj , and download
cost ηj in range 10 ∼ 100MB, 1 ∼ 3GHz, and 1 ∼ 5, respectively. The global conver-
gence requirement and the two constant variables are set based on [35]: ςj = 0.001,
ϑ = 15, and φ = 4. Three classical datasets in scikit-learn 1.0.2 [81] are used to train
linear regression (LR) models: California Housing dataset, Diabetes dataset, and ran-
domly generated LR datasets. Each LR model is trained with the loss of Mean Square
Error (MSE). In addition, we are interested in the performance of the proposed meth-
ods in non-convex loss functions. Hence, three different types of datasets are used
for these FL tasks: Fashion-MNIST (FMNIST)[120], Speech Commands[114], and
AG NEWS[138]. Each of them is trained with a CNN model.
We assign random data samples of these three datasets to clients in such a way
that each client has a different number of training and testing data. The Python
library PyTorch (v1.10) is used to build the model. All experiments are tested on
a Linux workstation including 16 CPU cores and 512GB of RAM, and 4x NVIDIA
Tesla V100 GPUs interconnected with NVlink2. Detailed parameters of both edge
cloud and FL models are listed in Table 4.1.
79
Baselines and Metrics: We compare our proposed algorithms (three-stage opti-
mization (THSO), three-stage greedy (GRDY) and two-stage optimization (TWSO))
with four competitive methods:
• ROUND[35]: It selects the FL workers and the local convergence rate for each
model based on a randomized rounding method [35]. Since it does not consider
the PS selection, we use a random choice for PS at the beginning.
• LOCAL[51]: It selects its top workers that will complete the local training first
(based on estimation). Again random decisions are used for the PS and local
rate.
80
6000 TWSO
4000 TWSO
THSO 3500 THSO
1000 1600
800
600 1500
400
1400
200
0 1300
TWSO THSO ROUND GRDY RAND DATA LOCAL TWSO THSO
Strategy Strategy
(c) detailed costs (d) single vs multiple models
Figure 4.5: Performance comparison with different metrics.
the average total learning cost. Better performances than ROUND (which focuses on
worker selection and learning rate optimization) confirm the advance of our method
by considering PS selection in the joint optimization. Better performances of our
methods and ROUND than DATA and LOCAL (which only focus on worker selection)
show the advantage of joint optimization. In all simulations, RAND has the worst
performance since it does not take any optimization.
Second, as shown in Fig. 4.5(a), the average total cost of every algorithm decreases
first and increases again as the number of edge servers increases. Initially, with more
edge servers, better chances to find a good solution to minimize the total cost of all
FL models. On the other hand, the further larger topology with more servers may
also begin to increase the average total cost due to larger transmission costs from
workers to PS.
Third, as shown in Fig. 4.5(b), as the global convergence rate increases, the aver-
age total cost decreases. This is reasonable since the larger global convergence rate
requests less local training or global update, which leads to lower total learning costs.
Fig. 4.5(c) also plots the detailed costs of different methods when 30 edge servers are
81
Average Communication Cost
TWSO 2000 TWSO
5000 THSO THSO
ROUND ROUND
1250 GRDY 1500 GRDY
1000 RAND RAND
DATA 1000 DATA
750 LOCAL LOCAL
500 500
250
1 2 3 4 5 1 2 3 4 5
Number of FL Models Number of FL Models
(c) local update cost (d) global update cost
Figure 4.6: Impact of the number of FL models on costs.
considered and the global convergence rate is 0.001. It shows that the local cost dom-
inates the total cost, and consequently, GRDY has a higher total cost than TWSO
and THSO as seen in Fig. 4.5(a).
We also evaluate the effects of joint optimization over multi-models compared
with the separative optimization with only a single model. In the latter case, we still
use TWSO and THSO but force them only on a single FL model at once, and thus
sequentially choose the decision for each model. Again we train 3 FL models when 30
edge servers are considered and the global convergence rate is 0.001. Fig. 4.5(d) shows
the comparison of determining the choices for three FL models jointly or sequentially
with TWSO and THSO. We can clearly see the lower total cost when we jointly
optimize the decisions. This confirms the effectiveness of jointly determining the
selection decision for multiple FL models rather than sequentially determining the
decision for every model.
82
4000
83
2800
GRDY
GRDY-Max
2600
Total Cost
2400
2200
2000 1 2 3 4 5
Number of Max Iterations
Figure 4.8: Comparison of two different processing orders of FL models.
Remember that in GRDY (Algorithm 5) we need select PS and workers for each
model following certain processing orders among FL models. We now study the
impact of different processing orders in GRDY. We test on two specific processing
orders: the default one with First-in-First-Serve (GRDY) and the variation in which
priority is given to the model with a larger size (GRDY-Max). The experiments run
under the edge cloud with 30 edge servers that have limited resources and significant
differences. We run 20 different cases in each different number of max iterations and
Fig. 4.8 shows the experiment result. First, as the number of max iterations increases,
the total cost of both two greedy algorithms decreases since it has more chances to
find a better solution with a lower cost. However, the improvement becomes smaller
84
1.0 1.0
j=6
0.9
j=7
0.5 0.8 j=8
Training Loss
R2 Score
0.7 j=9
0.0 j = 10
0.6
0.5 LR Model 1 0.5
LR Model 2
LR Model 3 0.4
1.0
6 7 8 9 10 0.3 0 20 40 60 80 100
Number of Workers Iterations
(a) R2 score of 3 LR models (b) loss over CA Housing
20000 3000
j=6 j=6
j=7 2500 j=7
15000 j=8 j=8
Training Loss
Training Loss
2000
j=9 j=9
j = 10 1500 j = 10
10000
1000
5000 500
0 20 40 60 80 100 0 0 20 40 60 80 100
Iterations Iterations
(c) loss over diabetes (d) loss over the random dataset
Figure 4.9: Training loss with LR models/tasks and the impact of FL workers.
when the max iteration further increases. Second, under the resource-limited scenario,
GRDY-Max performs better than GRDY in almost all cases. This result confirms
the necessity and superiority of selecting an optimal processing order in different edge
scenarios. In addition, we need to select an appropriate max iteration to control the
convergence speed of our greedy algorithms.
Fig. 4.9 shows the training loss of our method in real-world federated learning
experiments over LR datasets. We introduce the R2 score metric to evaluate the
performance of LR model (convex) training. R2 score is the proportion of the variance
in the dependent variable that is predictable from the independent variable(s). In this
set of experiments, we concurrently train 3 LR models with 3 different datasets. Each
dataset is split into 10 edge servers unequally (i.e. non-IID setting) and the number
of global training rounds is 100. We can see from Fig. 4.9(b)-(d), the training loss
decreases as the number of workers (κj ) increases for each model. Fig. 4.9(a) shows
85
0.8
0.8
Training Accuracy
Training Accuracy
0.7
0.7 0.6 j=3
j=4
0.5
0.6 FMNIST j=5
SpeechCommand 0.4 j=6
0.5 AG_News 0.3 j=7
0 100 200 300 0 100 200 300
Iterations Iterations
(a) accuracy of 3 FL models (b) accuracy on FMNIST
0.850
0.825 0.8
Training Accuracy
Training Accuracy
0.800 j=3 0.7 j=3
j=4 j=4
0.775 j=5 0.6 j=5
0.750 j=6 j=6
j=7 0.5 j=7
0.725 0 100 200 300 0 100 200 300
Iterations Iterations
(c) accuracy on Speech Com (d) accuracy on AG News
Figure 4.10: Training accuracy with three FL tasks and the impact of FL workers.
the R2 score of all LR models. Obviously, with more workers, the R2 score of all
models increases, which means all models are well-regressed. However, model 2 has
a worse R2 score (a negative value) in fewer workers due to the small size of the
training dataset. But as the number of workers increases, the performance of model
2 becomes better.
Fig. 4.10 also reports the learning accuracy of our method on more complex FL
tasks with different numbers of workers (due to space limitations, we only show the
one from THSO). Here, the datasets of three FL models (image classification, speech
recognition, text classification) are split into 30 partitions and the number of the
global update is set to 300. Fig. 4.10(a) shows the training accuracy of all three FL
models increases with the increasing number of iterations. Fig. 4.10(b)-(d) shows
the detailed training accuracy of three different models with different numbers of FL
workers. We can observe that with more FL workers, the training accuracy of all
models can reach a higher value. However, when comparing the result in Fig. 4.7, the
more FL workers, the more total cost consumed. Hence, there is a trade-off between
86
the training accuracy and the total cost. Another interesting observation is that for
FMNIST and Speech Command the accuracy increases with more FL workers, but
for AG News the accuracy is similar or the difference is very minimal. This may
be due to the simplicity of the AG News learning task. In summary, one needs to
consider the trade-off between the training accuracy and the total cost. If you need
more FL workers, it will incur more total cost but get higher training accuracy, and
vice versa.
87
Recently, Nguyen et al.[74] studied resource sharing among multiple FL ser-
vices/models in edge computing where the user equipment is used as an FL worker,
and proposed a solution to optimally manage the resource allocation and learning
parameter control while ensuring energy consumption requirement. However, their
FL framework is different from ours. First, they use the user equipment as FL work-
ers, while we use edge servers for FL workers. Second, they do not consider the PS
selection, since they use a single edge server as the PS. Third, their model allows
to train multiple FL models at the same user equipment (while we do not allow the
edge server acts as workers for multiple models in the same time unit), and thus their
method has to manage the CPU and bandwidth allocation on the equipment.
Both [56] and [58] considered a client-edge-cloud hierarchical federated learning
(HFL) where cloud and edge servers work as two-tier parameter servers to aggregate
the partial models from mobile clients (i.e. FL workers). Liu et al. [56] proved the
convergence of such an HFL, while Luo et al. [58] also studied a joint resource alloca-
tion and edge association problem for device users under such an HFL framework to
achieve global cost minimization. Wang et al. [113] considered the cluster structure
formation in HFL where edge servers are clustered for model aggregation. Recently,
Wei et al. [116] also studied the participant selection for HFL in edge clouds to
minimize the learning cost. However, our FL framework does not use HFL.
Meng et al. [66] focused on model training of DFL using decentralized P2P meth-
ods in edge computing. While their method also selected FL workers from an edge
network, the model aggregation was performed at edge devices based on a dynam-
ically formatted P2P topology (no PS). Therefore, it is different from our studied
problem, which mainly focuses on CFL.
There are also other works [101, 131, 48] where energy efficiency and/or wireless
communication have been taken into consideration in FL in edge systems.
88
to minimize the total FL cost of multiple models while ensuring their convergence
performance. We propose three different algorithms to decompose the original prob-
lem into multi-stages so that each stage can be solved by an optimization solver or
a greedy algorithm. Extensive simulations with real FL experiments show that our
proposed algorithms outperform similar existing solutions.
89
CHAPTER 5
QUANTUM-ASSISTED SCHEDULING
ALGORITHMS
5.1 Introduction
With the advancement of technology, quantum computing (QC) has gained much
attention due to the realization of speedups offered by quantum techniques for com-
plex computational problems. This has resulted in transformational breakthroughs
on specific tasks accomplished with near-term quantum computers. QC has more
computational power than classical computers and may be faster at solving com-
plex optimization problems, e.g., random quantum circuit sampling [6], Gaussian
boson sampling [143], and combinatorial optimization [77, 89, 5]. In this paper, by
leveraging the parallel computing capability of QC, we focus on designing a new
quantum-inspired scheduling algorithm to solve a complex joint participant selection
and learning scheduling problem for federated learning (FL) in distributed networks.
FL is a distributed artificial intelligence (AI) approach that allows for the training
of high-quality AI models by aggregating local updates from multiple FL clients (or
workers), such as IoT devices, without direct access to the local data [64, 33, 91, 116].
This potentially prevents the disclosure of sensitive user information and preferences,
reducing the risk of privacy leakage. Nevertheless, when deploying the FL framework
in distributed networks, there are two challenges. First, the computing power and
network resources of servers, as well as their data distribution, are diverse. Some
low-performance servers may cause the convergence process to slow down and reduce
training performance. Furthermore, dispersed computing resources and high network
latency may result in high training costs. Second, in the practical scenario, concur-
rently training multiple models in the shared distributed network creates competition
for computing and communication resources. As shown in Fig. 5.1, two FL models are
trained concurrently and each FL model requires one PS and three workers for model
training. In this case, which FL model is preferentially served at which server directly
affects the total training cost of all FL models. To this end, appropriate participant
90
Worker 1 PS 1 Worker 2
Worker 2 Worker 3
PS 2
Worker 1
Worker 3
Distributed Server Model broadcasting & global aggregating
selection and learning scheduling decision are fairly crucial for the multi-model FL
training scenario.
As a result, we concentrate primarily on the problem of joint participant selection
and learning scheduling in multi-model FL training scenarios. It should be noted
that in distributed networks, any server can serve as both a PS and a client, and
that participant selection includes selecting both the PS and clients for each FL
model. For clarity, we refer to a client as an FL worker. It is worth noting that both
participant (client) selection and learning scheduling problems have been studied in
FL using classical computers recently [76, 35, 108]. However, most existing works
focus on optimizing a single global FL model rather than multiple FL models. More
importantly, none of these works take into account the PS selection for multiple FL
models. Recently, Wei et al. [117] considered a joint participant selection and learning
scheduling problem in multi-model federated edge learning, and proposed multi-stage
methods to solve the joint optimization problem. However, due to the nature of
the formulated optimization as a mixed-integer non-linear program (MINLP), the
proposed methods may not lead to optimal solutions and do not scale well when the
problem grows more complex.
To address the aforementioned issue, quantum computing has recently emerged
as a powerful optimization tool [77, 89, 5]. Such approaches, however, may not be
91
competitive until the shortcomings of QC, such as the limited number of qubits,
are overcome by further technological advancements. To that end, several hybrid
quantum-classical solutions [102, 3] have been proposed to tackle optimization prob-
lems by leveraging the complementary strengths of quantum and classical comput-
ers. Inspired by the pioneers, we attempt to solve our joint participant selection
and learning scheduling problem by the hybrid quantum-classical optimization ap-
proach combined with decomposition techniques. Such an approach enables us to
utilize the capabilities of both quantum and classical computers fully. In addition,
D-wave stands out in the quantum computing market because it offers the quantum
annealer computer with the most qubits of all the candidates. With D-wave’s quan-
tum annealer computer, one can solve an integer linear programming (ILP) problem
by converting it into a quadratic unconstrained binary optimization (QUBO) model,
which is inspired by the Ising model. As a result, we attempt to develop novel hybrid
quantum-classical algorithms on the D-Wave’s quantum computer.
Three research challenges exist in developing efficient hybrid quantum-classical
techniques with decomposition schemes. First, how to convert our original MINLP
problem into an ILP problem and even further convert it into a QUBO model as an
input to the D-Wave’s quantum computer? Second, how to design a novel hybrid
quantum-classical strategy that solves the corresponding problem in fewer iterations?
Third, how to derive an efficient number of integer cuts that iteratively reduce the
search space and accelerate the convergence of the hybrid quantum-classical methods?
To handle the above challenges, we develop two novel hybrid quantum-classical algo-
rithms to demonstrate the potential of such hybrid approaches.
92
age capacity sci and CPU frequency sfi while each link ej has an available bandwidth
bj . Each server holds a distinct set of datasets and can be used for local model train-
ing. We assume that each server can hold multiple types of datasets for FL training
and the dataset used by the j-th FL model in the i-th server is denoted by Di,j . In
this paper, we focus on the participant selection based on computing/communication
resources in the distributed network and do not consider training data distributions
(which is another important research topic and orthogonal to our research).
We further assume that each server can only play a role as either the PS or the worker
for any FL model at one time.
The training process of each FL model includes three stages: (a) initializing and
broadcasting the global model of mj to each participant; (b) each worker performs the
local model computation using its own dataset; and (c) aggregating the local models
from workers, as illustrated in Fig. 4.2 and detailed in the sequel.
Stage 1: Global Model Initialization. In Stage 1, we initialize the global model
parameter for each FL model as ωj and send the global model parameter to each
selected participant.
Stage 2: Local Model Computation. Let the local model parameters of model
mj on the server vi be ωi,j and the loss function on a training data sample s be
93
fi,j (ωi,j , dxs , dys ), where dxs is the input feature and dys is the required label. Then
the loss function on the whole local dataset of vi is defined as
1 X
Fi,j (ωi,j ) = fi,j (ωi,j , dxs , dys ). (5.1)
|Di,j | s∈D
i,j
Generally, FL will perform round by round and we denote the total number of global
aggregation, and local updates as α̂ and β̂, where α and β are their indexes, respec-
tively. In the α-th round, each worker runs a number of local updates to achieve a
local convergence accuracy ϱj ∈ (0, 1). At the β-th local iteration, each worker follows
the same local update rule as
where η is the learning rate of the loss function. This process will run until
α,β̂ ∗ α,0 ∗
Fi,j (ωi,j ) − Fi,j ≤ ϱj [Fi,j (ωi,j ) − Fi,j ]. (5.3)
α,0
Here, we set ωi,j = ωj .
Stage 3: Global Aggregation. At this stage, one participant has to be chosen as
α,β̂
the PS. After β̂ local updates, all workers send their local model parameter ωi,j to
the PS. The PS performs FedAvg to aggregate the global model parameters as
X Di,j α−1,β̂
ωjα = ωi,j , (5.4)
i∈Sj
Dj
S
where Dj = i∈Sj Di,j is the total number of data sample from κj workers and Sj is
the set of selected workers. The global convergence of the global model is defined as
94
that. Then we have the following relationship between the convergence rate and the
local update as well as global iterations [35, 36, 14, 131, 60, 92].
2λ2
1 1 1 1
ϑj ≥ 2 ln ≜ ϑ0 ln , (5.6)
γ ξ ςj 1 − ϱj ςj 1 − ϱ j
2 1 1
φj ≥ log2 ≜ φ0 log2 , (5.7)
(2 − λδ)δγ ϱj ϱj
γ 2
where ξ and δ are two variables in ranges (0, λ
] and (0, L
), respectively. λ is the
λ-Lipschitz parameter and γ is the γ-strongly convex parameter. Both the value of
λ and γ are determined by the loss function. ϑ0 and φ0 are two constants where
2λ2 2
ϑ0 = γ2ξ
and φ0 = (2−λδ)δγ
.
respectively. In addition, xi,j or yi,j are the decision variables on whether to select
server vi as a parameter server or an FL worker for the j-th FL model.
Local Update Cost: Let ψ(· ) be the function to define CPU cycles to process
the sample data Dj,i used by the j-th FL model and stored in server vi . So the all
ψ(Dj,i )
local update cost for the j-th FL model is defined as Cjlocal = ϑj · φj · N
P
i=1 yi,j · sfi .
Global Aggregation Cost: Similarly, the global aggregation step for the up-
ψ(µj )
loaded FL models is defined as Cjglobal = ϑj · N
P
i=1 xi,j · sfi .
95
5.2.4 Problem Formulation
Under the previously introduced multi-model federated learning scenario, we con-
sider how to choose participants for each of the models and how to schedule their
local/global updates. Recall that we assume that only one PS and κj workers se-
lected for one model, i.e., M
P PM
i=1 xi,j = 1 and i=1 yi,j = κj . We use ϱj ∈ [0.01, 0.99]
Constraints (5.8a) and (5.8b) make sure that the storage and CPU satisfy the re-
quirements from the FL model. Constraint (5.8c) guarantees the number of PS and
FL workers of each model is 1 and κj , respectively. Constraint (5.8d) ensures that
each server only trains one FL model and can only play one role at one time. The de-
cision variables and their ranges are given in (5.8e)-(5.8g). Note that the formulated
problem (5.8) is a non-linear mixed-integer program, which is NP-hard in general and
challenging to solve with classical computing.
96
Master Problem – IP Problem
Solve, !!,# and #!,# via format & solve QUBO problem
Benders' Decomposition
QPU
no
an optimality or
feasibility cut
{"! , $! }
!!,# , #!,# and 5#
CPU
Worker 1 PS 1 Worker 2
Subproblem – LP Problem
Worker 2 Worker 3
PS 2
97
We reformulate our original problem (5.8) by extracting all constant variables and
further introduce additional continuous variables uj and wj to replace ϱj as below.
W
X N X
X N N
X N
X
min [uj · a1,i,j,k · xk,j · yi,j + wj · a2,i,j · yi,j + uj · a3,i,j · xi,j
x,y,u,w
j=1 k=1 i=1 i=1 i=1
N
X
+ a4,i · (xi,j + yi,j )] (5.9)
i=1
where the four sets of constant variables are a1,i,j,k = 2ϑ0 ln( ς1j )· ρj (vi , vk ), a2,i,j =
ψ(Dj,i ) ψ(µj )
φ0 ϑ0 ln( ς1j )· fi
, a3,i,j = ϑ0 ln( ς1j )· fi
, and a4,i = δfi . Also, uj = 1
1−ϱj
, wj =
uj
uj log2 ( uj −1 ), b1 = 1.01, b2 = 100, b3 = 1.435 and b4 = 6.725. Note that Problem (5.9)
consists of several terms that are the products of integer and continuous variables,
e.g. uj · xk,j · yi,j , wj · yi,j . Hence, we further introduce variables ok,i,j , pi,j , and qi,j to
represent the product of an integer variable and a continuous variable as below.
W X
X N
N X N
X N
X N
X
min [ a1,i,j,k · ok,i,j + a2,i,j · pi,j + a3,i,j · qi,j + a4,i · (xi,j + yi,j )]
x,y,u,w,o,p,q
j=1 k=1 i=1 i=1 i=1 i=1
(5.10)
s.t. (5.8a) − (5.8g), (5.9a), (5.9b),
b1 xk,j yi,j ≤ ok,i,j ≤ b2 xk,j yi,j , (5.10a)
uj − ok,i,j ≤ b2 (1 − xk,j yi,j ), (5.10b)
uj − ok,i,j ≥ b1 (1 − xk,j yi,j ), (5.10c)
b3 yi,j ≤ pi,j ≤ b4 yi,j , (5.10d)
wj − pi,j ≤ b4 (1 − yi,j ), (5.10e)
wj − pi,j ≥ b3 (1 − yi,j ), (5.10f)
b1 xi,j ≤ qi,j ≤ b2 xi,j , (5.10g)
uj − qi,j ≤ b4 (1 − xi,j ), (5.10h)
uj − qi,j ≥ b3 (1 − xi,j ). (5.10i)
98
So far, we have linearized the product of binary and continuous variables as
(u, w, o, p, q), and therefore we can apply Benders’ Decomposition. In problem (5.10),
for each possible choice x̄ and ȳ, we find the best choices for u, w, o, p, q by solving
a linear program. So we regard u, w, o, p, q as a function of x, y. Then we replace
the contribution of u, w, o, p, q to the objective with a scalar variable representing the
value of the best choice for a given x̄ and ȳ. We start with a crude approximation
to the contribution of u, w, o, p, q and then generate a sequence of dual solutions to
tighten up the approximation. In addition, the problem (5.10) can be rewritten as a
general form as follows.
min c⊺ X + h⊺ Y (5.11)
X,Y
s.t. A1 X = a1 , (5.11a)
A2 X ≤ a2 , (5.11b)
BX + GY ≤ a3 , (5.11c)
X = [x, y]⊺ , X ∈ X, (5.11d)
Y = [u, w, o, p, q]⊺ , Y ∈ Y. (5.11e)
where c and h are coefficients for binary and continuous variables in the objective
function, respectively. A1 , A2 , B, G are coefficients in the constraints while a1 , a2 and
a3 are constant vectors.
Next, we will detail the formulation of the corresponding subproblem (LP prob-
lems) and master problem (an integer programming (IP) problem) after the Benders’
Decomposition.
99
The general form of the subproblem can be further represented as follows.
s.t. − GY ≥ BX − a3 , (5.13a)
Y = [u, w, o, p, q]⊺ , Y ∈ Y. (5.13b)
In addition, the dual problem of the subproblem is defined below and π is the
dual variable.
s.t. − G⊺ π ≤ h, (5.14a)
π ≥ 0, (5.14b)
s.t. A1 X = a1 , (5.15a)
A2 X ≤ a2 , (5.15b)
λ ≥ λdown , (5.15c)
λ ≥ (BX − a3 )⊺ π k , ∀k ∈ K̂, (5.15d)
X = [x, y]⊺ , X ∈ X. (5.15e)
where λ is the optimal value of the subproblem at the current iteration. Constraints
(5.15c) is the feasible lower bound of the subproblem and (5.15d) is the correspond-
ing Benders’ cut, where K̂ is the stored index set of feasibility cuts from previous
iterations.
QUBO Formulation. Quantum annealers are able to solve the optimization
problem in a QUBO formulation. To leverage the state-of-art quantum annealers
100
provided by D-Wave, the master problem has to be converted to the corresponding
QUBO formulation. Due to the rule of QUBO setup, we have to reformulate our
constrained master problem as the unconstrained QUBO by using penalties. The
basic idea is to find the best penalty coefficients of the constraints. Following the
principle of constraint-penalty pairs in [23], the constraints are converted as follows.
(5.15a) ⇒ ξ1 : P 1 (A1 X − a1 )2 ,
l̄ 2
X
2
(5.15b) ⇒ ξ2 : P (A2 X − a2 + 2l s2l )2 ,
l=0
Here, P ∗ is the predefined penalty vector when the corresponding constraint is vio-
lated. s∗ is a binary slack variable and ¯l∗ is the upper bound of the number of slack
l
max c⊺ X + λ + ξ1 + ξ2 + ξ3 + ξ4 . (5.16)
X
Variable Representation. Now consider the problem (5.16), it is still not the
QUBO formation due to the existence of the continuous variable λ. Thus, we need to
represent the continuous variable λ using binary bits. We use a binary vector w with
the length of M bits to replace continuous variable λ and denote it as a new discrete
number λ̂ ∈ Q. In general, λ̂ requires the binary numeric system assigning M bits to
replace continuous variable λ. Then we can recover the λ̂ by
m̄+ m̄−
X X
ii
λ= 2 wii+m − 2jj wjj+1+m+m̄+ = λ̂(w). (5.17)
ii=−m jj=0
In (5.17), m̄+ + 1 is the number of bits for the positive integer part Z+ , m is the
number of bits for the positive decimal part and m̄− + 1 is the number of bits for the
101
' optimality/feasibility cuts
an optimality/feasibility cut
CPU
"
-!
-"
CPU
"
…
QPU QPU
-"
#
Master Subproblem Master
problem solved by problem CPU
solved by classical solved by
quantum computer quantum
computer computer Subproblems
solved by
' classical
(a) (b) computers
Figure 5.3: Flow of HQCBD with a single cut and multi cuts.
negative integer part Z− . Then, the final QUBO formulation of the master problem
is defined as follows.
102
Algorithm 7 Hybrid Quantum-Classical Benders’ Decomposition (HQCBD)
Input: Distributed network with N servers V , W FL models M , Coefficient of the
objective function and constraints in master problem and subproblem
Output: All decision variables X and Y
1: Initialize upper/lower bound of λ, λ = +∞, λ = −∞
2: Initialize threshold ϵ = 0.001, max itr = 100, itr = 1
3: while |λ − λ| > ϵ and itr < max itr do
4: P ← Appropriate penalty numbers or arrays
5: Q ← Reformulate both objective and constraints in (5.10) and construct QUBO
formulation as (5.18)
6: X′ ← Solve problem (5.18) by quantum computer
7: λ ← Extract w and replace λ with λ̂(w) as (5.17)
8: SU P (X) ← Solve problem (5.14) with fixed X′
9: λ ← SU P (X)
10: Add a Benders’ cut to the master problem as (5.15d)
11: itr+ = 1
12: end while
13: return X, Y
We leverage the D-Wave solver to implement our proposed algorithm to solve the
QUBO master problem. In addition, the penalties also need to be carefully tuned for
a decent QUBO model. In general, a large penalty can cause the quantum annealer
to malfunction due to coefficient explosion. In contrast, a small penalty can make the
quantum annealer ignore the constraints. A well-tuned penalty will lead to a fairly
high probability of the quantum solver giving the correct answer.
103
Algorithm 8 Multiple-cuts Benders’ Decomposition (MBD)
Input: Distributed network with N servers V , W FL models M , Coefficient of the
objective function and constraints in master problem and subproblem, number of
cuts σ
Output: All decision variables X and Y
1: Initialize upper/lower bound of λ, λ = +∞, λ = −∞
2: Initialize threshold ϵ = 0.001, max itr = 100, itr = 1
3: while |λ − λ| > ϵ and itr < max itr do
4: P ← Appropriate penalty numbers or arrays
5: Q ← Reformulate both objective and constraints in (5.10) and construct QUBO
formulation as (5.18)
6: {X′ }σ ← Solve problem (5.18) by quantum computer and return σ feasible
solutions
7: λ ← Extract w with highest value and replace λ with λ̂(w) as (5.17)
8: {SU P (X)}σ ← Solve σ subproblems (5.14) with fixed X′ in parallel
9: λ ← {SU P (X)}σ with lowest value
10: Add all σ benders’ cut to the master problem as (5.15d)
11: itr+ = 1
12: end while
13: return X, Y
104
reaches the threshold, the iteration will be stopped since the upper bound and lower
bound converge to the predefined threshold.
5.4 Evaluation
In this section, we simulated a distributed network environment and conducted
experiments of realistic FL tasks using publicly available datasets. To validate the
feasibility of our hybrid quantum-classical optimization algorithm, we run the pro-
posed algorithms on a hybrid D-Wave quantum processing unit (QPU). We accessed
the D-Wave system provided by Leap quantum cloud service [98]. Based on the Pega-
sus topology, the D-Wave system also has over 5k qubits and 35k couplers, which can
solve complex problems of up to 1M variables and 100k constraints. We performed
a number of test cases that can be resolved in under 100 iterations, but only due to
the high cost of QPU utilization and the developer’s time constraints.
105
dataset and (ii) Logistic Regression with the cross-entropy loss on MNIST. We are
also interested in the performance of our proposed methods on FL models with non-
convex loss functions. Thus, three datasets, MNIST, FMNIST, and CIFAR-10, are
used to train convolutional neural network (CNN) models with different structures.
Benchmarks and Metrics: We compare our proposed HQCBD and MBD al-
gorithms with three baseline strategies: classical Benders’ decomposition (CBD), ran-
dom algorithm (RAND), and two-stage iterative optimization algorithm (TWSO)[117].
CBD uses a classical LP solver (Gurobi[27] or Scipy[105]) to solve the master problem
and subproblems. RAND randomly generates the random decisions on the model’s
parameter server, FL workers, and local convergence rate under certain constraints.
TWSO is a previous algorithm [117] that decomposes the original problem into two
subproblems (participant selection and learning scheduling) and solves them itera-
tively. The following metrics are adopted to compare the performances of our pro-
posed methods and the baselines: the total cost of FL training, the loss or accuracy
of FL models, the number of iterations, the solver accessing time and the gain or
advancement of our proposed algorithms over CBD.
106
4500 16000 Upper bound of
4000 14000 Lower bound of
3500 12000
Value of
Value of
3000 10000
2500 8000
2000 Upper bound of 6000
1500 Lower bound of 4000
0 10 20 30 0 10 20 30 40
Rounds Rounds
(a) Case 1 (b) Case 2
30000 Upper bound of 8000 CBD
HQCBD
20000
6000
15000
5000
10000
4000
0 20 40 60 80 0 10 20 30 40 50
Rounds Rounds
(c) Case 3 (d) Master problem value
Figure 5.4: Performance of HQCBD: its convergence.
Table 5.1: Iteration comparison of CBD and HQCBD over three different cases.
Case Set up # of Variables Itr. of CBD Itr. of HQCBD
1 {7, 1, 3} 63 32 31
2 {7, 2, 2} 126 55 45
3 {9, 2, 3} 198 91 89
This result proves that our proposed algorithm is mathematically consistent with the
classical Benders’ decomposition algorithm. In addition, Fig. 5.4(d) shows the trend
of the master problem value of case 2 calculated by (5.16) compared with the solution
of CBD. We can see that the value of the master problem keeps increasing until it
converges. Specifically, the master problem value keeps static in the first few rounds
since only an unbounded ray is found in the subproblem and a feasibility cut is added
to the master problem. As we run more iterations, the optimality cut is found and
added to the master problem. Once the difference between the upper bound and
lower bound reaches a threshold, the problem is solved. The solution from HQCBD
is similar to the one from CBD.
107
Solver Accessing time (ms)
Local - CBD Case 1
Gain/Advancement (%)
200 QPU - HQCBD 60 Case 2
Case 3
150
40
100
20
50
0 0
0 10 20 30 40 50 1 2 3 4 5
Rounds Number of Cuts
(a) CBD vs HQCBD (b) MBD gains over CBD
Figure 5.5: Comparison of the real solver accessing time and gains of MBD over CBD
in different cases.
Table 5.2: Solver accessing time (ms) comparison of CBD and HQCBD.
CBD HQCBD
Case
Max / Min Avg / Std Max / Min Avg / Std
1 190.47 / 6.71 117.14 / 50.12 32.10 / 15.93 31.49 / 2.79
2 235.29 / 9.11 129.56 / 50.04 32.11 / 15.92 24.18 / 7.98
3 395.48 / 14.45 120.25 / 63.19 32.11 / 16.01 25.53 / 7.85
Table 5.1 further demonstrates the detailed comparison between CBD and HQCBD
in terms of the number of iterations used to solve the problem. We can find that
HQCBD takes fewer iterations to converge to the optimal solution compared with
CBD (for example, for Case 2, the improvement of iterations is around 18%).
Moreover, we show the comparison of real solver accessing time (i.e., the compu-
tation time of the solvers) for CBD and HQCBD in Table 5.2 and plot the detailed
accessing time of Case 2 in Fig. 5.5. The solver accessing time is the real accessing
time of QPU solver and local solver without considering other overheads, such as
variables setting time, parameters transmission time, and so on. As we can see in Ta-
ble 5.2, the minimal accessing time of CBD is relatively lower than that of HQCBD.
However, the maximal and average accessing time as well as the standard deviation
value of CBD are significantly higher than HQCBD. For example, for Case 2, the
mean accessing time of HQCBD is 81% less than the one of CBD, and more signifi-
cantly the standard deviation of accessing time of HQCBD is 84% less than the one
of CBD. We also confirm via Fig. 5.5(a) that the solver accessing time of CBD in
each round/iteration varies significantly while the solver accessing time of HQCBD
108
3500 8000
Value of
10000
CBD 8000
9500 MBD - 1
MBD - 3 6000
9000 MBD - 5 4000
0 20 40 60 80 0 10 20 30 40 50
Rounds Rounds
(c) Case 3 (d) Convergence comparison
Figure 5.6: Performance of MBD: its convergence.
in each round keeps stable and is even smaller than that of CBD. This finding proves
the efficiency and robustness of leveraging the hybrid quantum-classical technique to
solve the optimization problem in terms of either the convergence iteration or the
solver accessing time.
Performance of MBD
We now evaluate the efficiency of our proposed MBD algorithm. Similarly, we con-
sider three different cases with different numbers of servers, FL models, and workers.
We study the impact of the number of cuts σ used in MBD and we select the value
from 1, 3, and 5. Recall that when σ = 1, MBD is our standard HQCBD. Table 5.3
109
1500 RAND 2000 RAND
TWSO TWSO
1250 HQCBD 1500 HQCBD
Total Costs
Total Costs
1000
750 1000
500
500
250
0 7 8 9 10 11 0 2 3 4 5 6
Number of Servers Number of Workers
(a) Impact of server number (b) Impact of worker number
Figure 5.7: Performance comparison with existing methods.
and Fig. 5.6 shows the result of multiple cuts and convergence comparison with CBD.
In Fig. 5.6(a)-(c), MBD-1 is our proposed HQCBD algorithm where only a single cut
is added to the master problem, while MBD-3 or MBD-5 means 3 or 5 cuts are added
to the master problem. In this scenario, we can find that our MBD-1 (HQCBD)
converges faster than the CBD. But with more cuts (larger σ), the convergence speed
of MBD-σ becomes faster. Table 5.3 lists the detailed comparison between CBD and
MBD for different cases. Fig. 5.6(d) further demonstrates the upper and lower bound
detailed convergence comparison between our proposed algorithm MBD with σ = 5
and the CBD in Case 2. We can see that our proposed methods use fewer rounds
(29) to converge the optimal value compared with the classical one (55).
We also plot the gain or advancement of MBD over CBD in terms of iteration
reduction for different numbers of cuts in Fig. 5.5(b). Obviously, different numbers
of cuts have achieved different positive gains or advancements in different cases. The
largest improvement is up to 70.3% for Case 3 with σ = 5. This further proves the
efficiency of our both proposed algorithms HQCBD and MBD.
We now compare our proposed method HQCBD with the random method (RAND)
and a two-stage iterative optimization method (TWSO) [117] in terms of solving the
joint optimization problem.
Firstly, we focus on the necessity of the optimization problem and study the
impact of different numbers of servers. We concurrently train 2 FL models with 2
workers per model and the number of servers varies from 7 to 11. Fig. 5.7(a) shows
110
the results. Obviously, RAND has the worst performance due to its randomness. Our
HQCBD algorithm gets further improvements compared with our proposed TWSO
and demonstrates the effectiveness of the HQCBD algorithm. In addition, as the
number of servers increases, the total cost of HQCBD first decreases and increases
then decreases again. This is because the topology may change when the server
number varies and lead to the change of selection decision as well as the total cost.
Next, we investigate the impact of different numbers of FL workers on total costs.
We set the number of servers and FL models to 15 and 2, respectively. The number
of FL workers is in the range of [2, 6]. As shown in Fig. 5.7(b), the total costs increase
as the number of workers increases. This is obvious since the more workers, the more
total costs consumed. Our proposed HQCBD still outperforms RAND and TWSO
algorithms. With more qubits supporting, we expect that the speed of HQCBD will
have a more significant advantage over TWSO on large-scale optimization problems.
111
adaptive control in edge computing without client selection. They proposed a con-
trol algorithm to determine the trade-off between local update and global parameter
aggregation so as to minimize the loss function. Both [56] and [58] considered a client-
edge-cloud hierarchical federated learning (HFL) where cloud and edge servers work
as two-tier parameter servers to aggregate the partial models from mobile clients (i.e.
FL workers). Liu et al. [56] proved the convergence of such an HFL, while Luo et
al. [58] studied a joint resource allocation and edge association problem for device
users under such HFL framework to achieve global cost minimization. Wang et al.
[113] also considered the cluster structure formation in HFL where edge servers are
clustered for model aggregation. Meng et al. [66] focused on model training using
decentralized P2P methods in edge computing. While some of these works also con-
sider learning control of FL, they either consider different FL topologies (e.g. HFL,
DFL) or optimize different objectives.
112
prove the training performance and indicated that clients with the greatest utility
can improve model accuracy and hasten the convergence speed. Furthermore, Jin
et al. [35] considered both learning control of FL and edge provisioning problem
in distributed cloud-edge networks. While their work is similar to ours, they didn’t
consider the parameter server selection problem and the remote cloud center always
plays the role of PS in their scenario. In addition, all aforementioned works do not
take the concurrent multiple FL models training case into account which significantly
affects the total training performance of all FL models.
113
to tackle a specific real-world optimization problem that jointly optimizes the client
selection and learning schedule in multi-model FL. For this particular problem, we
propose a distinct solving process where the binary master problem is solved by quan-
tum annealer and subproblems with continuous variables however are addressed by
the classical computer. Different from [142, 21], we also consider the multiple-cuts
strategy to hasten the convergence speed.
114
CHAPTER 6
DISSERTATION CONCLUSION
115
in our cases, we also leveraged reinforcement learning (RL) techniques to tackle our
joint optimization problem.
Last but not least, we considered a multi-model federated edge learning
where multiple FEL models are being trained in the edge network and edge servers
can act as either parameter servers or workers of these FEL models. We formu-
lated a joint participant selection and learning scheduling problem, which
is a non-linear mixed-integer program, aiming to minimize the total cost of all FEL
models while satisfying the desired convergence rate of trained FEL models and the
constrained edge resources. We then designed several algorithms by decoupling the
original problem into two or three sub-problems that can be solved respectively and
iteratively. We even extend our work to other training topologies (e.g., DFL, HFL)
and proposed several heuristic algorithms to solve the optimization problems. We
further proposed a novel Hybrid Quantum-Classical Benders’ Decomposition
(HQCBD) algorithm to tackle the joint participant selection and learning schedule
problem. By combining quantum computing and classical optimization techniques,
our HQCBD algorithm can quickly converge to the desired solution, just like the
classical BD algorithm, but with far fewer iterations and at much faster speeds. We
also presented a multiple-cuts version of HQCBD (MBD) to accelerate con-
vergence by forming multiple cuts in each round using multiple outputs from the
quantum annealer. MBD can achieve varying levels of performance improvement by
selecting different numbers of cuts.
116
lems. The successful application of FL in supervised learning tasks arouses interest
in exploiting similar ideas in RL, i.e., FRL. FRL not only provides the experience
for agents to learn to make good decisions in an unknown environment but also en-
sures that the privately collected data during the agent’s exploration does not have
to be shared with others. However, most current works on FRL focus on horizontal
federated reinforcement learning (HFRL), in which the agents may be distributed
geographically, but they face similar decision-making tasks and have very little inter-
action with each other in the observed environments. Hence, I am interested in vertical
federated reinforcement learning (VFRL), which applies the methodology of VFL to
RL and is more realistic for the real-world scenario. In vertical federated learning
(VFL), samples of multiple data sets have different feature spaces but these samples
may belong to the same groups or common users. The training data of each partic-
ipant are divided vertically according to their features. More general and accurate
models can be generated by building heterogeneous feature spaces without releasing
private information. VFRL is suitable for Partial Observation Markov Decision Pro-
cess (POMDP) scenarios where different RL agents are in the same environment but
have different interactions with the environment. Compared with HFRL, there are
currently few works on VFRL. The drawback of current VFRL works is the small
feature space of states and limited training data. In addition, they only contain two
agents, and the structure of the aggregated neural network model is relatively sim-
ple. Hence, it is a great attempt to first implement a more flexible or general VFRL
framework and verify its effectiveness.
Hybrid Quantum-Classical Techniques for Satellite Edge Intelligence.
With the acceleration of beyond 6G wireless communication process, satellite commu-
nication technologies and high-altitude platforms (HAPs) or unmanned aerial vehicle
(UAV) communication technologies have attracted wide attention for their reduced
vulnerability to natural disasters and physical attacks. As a technology that has
been proven and deployed for a long time, satellite communications stand out for its
capacious service coverage capabilities. Recently, the integration of satellite, terres-
trial cell networks, and mobile edge computing has become a general trend for future
networks. However, there are still several challenges in the combination of edge com-
puting and satellites: 1) Limited visibility time of satellites; 2) Terrestrial edge and
cloud infrastructures are generally fixed, but satellites are moving assets; 3) Task
117
assignment and satellite edge state has to migrate across multiple neighbor satellites
if it beyond the coverage; 4) Satellite resources need to be shared with multiple tasks
rather than a specific edge computing task. Therefore, I am interested in developing
heuristic approaches by leveraging hybrid quantum-classical techniques to implement
resource allocation across satellite-terrestrial networks, server assignment for task ex-
ecution, load-aware offloading process, as well as satellite network instability detection
due to the continuous route changes or task assignments. In addition, to facilitate
the efficiency of the hybrid quantum-classical techniques, I am also interested in the
optimization or improvement of quantum computing.
Resource Management and Scheduling Optimization in Quantum Net-
works. Quantum networks use the quantum properties of photons to encode infor-
mation. For instance, photons polarized in one direction (for example, in the direction
that would allow them to pass through polarized sunglasses) are associated with the
value; one, photons polarized in the opposite direction (so they don’t pass through the
sunglasses) are associated with the value zero. Researchers are developing quantum
communication protocols to formalize these associations, allowing the quantum state
of photons to carry information from sender to receiver through a quantum network.
Hence, there exists quantum resource management and we propose optimization algo-
rithms to manage quantum resources and schedule quantum entanglement in quantum
networks.
Collaborative Intelligent Systems for Edge AIoT, AR/VR. Modern infor-
mation or network systems do not serve individual users in a vacuum but rather must
provide service simultaneously for a large number of users. Effective and broadly
applicable learning approaches should have both the flexibility to model and per-
sonalize to individual users, as well as the ability to intelligently balance the ex-
ploration/exploitation trade-off for entire populations of users. In addition to the
aforementioned future directions, I am also interested in developing collaborative in-
telligent systems for edge AIoT, AR/VR, and vehicular ad-hoc networks. It aims
to provide personalized data management and privacy protection services, as well as
integrate resource allocation, task assignment, and self-learning modules. The intel-
ligent collaborative system can suit many scenarios, such as the smart city, smart
healthcare, smart grid, intelligent robots, as well as advanced manufacturing, etc.
118
BIBLIOGRAPHY
[1] Hassan I Abdalla. An efficient approach for data placement in distributed sys-
tems. In 2011 Fifth FTRA international conference on multimedia and ubiqui-
tous engineering, pages 297–301. IEEE, 2011.
[2] Akshay Ajagekar, Kumail Al Hamoud, and Fengqi You. Hybrid classical-
quantum optimization techniques for solving mixed-integer programming prob-
lems in production scheduling. IEEE Transactions on Quantum Engineering,
3:1–16, Jun. 2022.
[3] Akshay Ajagekar, Travis Humble, and Fengqi You. Quantum computing based
hybrid solution strategies for large-scale discrete-continuous optimization prob-
lems. Computers & Chemical Engineering, 132:106630, Jan. 2020.
[4] Mohammad H Al-Shayeji, Sam Rajesh, Manal Alsarraf, and Reem Alsuwaid.
A comparative study on replica placement algorithms for content delivery net-
works. In 2010 Second International Conference on Advances in Computing,
Control, and Telecommunication Technologies, pages 140–142. IEEE, 2010.
[5] Dong An and Lin Lin. Quantum linear system solver based on time-optimal adi-
abatic quantum computing and quantum approximate optimization algorithm.
ACM Transactions on Quantum Computing, 3(2):1–28, Jun. 2022.
[6] Frank Arute, Kunal Arya, Ryan Babbush, Dave Bacon, Joseph C Bardin, Rami
Barends, Rupak Biswas, Sergio Boixo, Fernando GSL Brandao, David A Buell,
et al. Quantum supremacy using a programmable superconducting processor.
Nature, 574(7779):505–510, Oct. 2019.
[7] Cheikh Saliou Mbacke Babou, Doudou Fall, Shigeru Kashihara, Yuzo Taenaka,
Monowar H Bhuyan, Ibrahima Niang, and Youki Kadobayashi. Hierarchical
load balancing and clustering technique for home edge computing. IEEE Access,
8:127593–127607, 2020.
[8] Ravikumar Balakrishnan, Tian Li, Tianyi Zhou, Nageen Himayat, Virginia
Smith, and Jeff Bilmes. Diverse client selection for federated learning via sub-
119
modular maximization. In International Conference on Learning Representa-
tions (ICLR), Virtual, Jan. 2022.
[9] Logan Beal, Daniel Hill, R Martin, and John Hedengren. Gekko optimization
suite. Processes, 6(8):106, 2018.
[10] Ran Bi, Qian Liu, Jiankang Ren, and Guozhen Tan. Utility aware offloading
for mobile-edge computing. Tsinghua Science and Technology, 26(2):239–250,
2020.
[11] Suzhi Bi, Liang Huang, and Ying-Jun Angela Zhang. Joint optimization of
service caching placement and computation offloading in mobile edge computing
systems. IEEE Transactions on Wireless Communications, 19(7):4947–4963,
2020.
[12] Martin Breitbach, Dominik Schäfer, Janick Edinger, and Christian Becker.
Context-aware data and task placement in edge computing environments. In
2019 IEEE International Conference on Pervasive Computing and Communi-
cations (PerCom, pages 1–10. IEEE, 2019.
[13] André Brinkmann, Kay Salzwedel, and Christian Scheideler. Efficient, dis-
tributed data placement strategies for storage area networks. In Proceedings of
the twelfth annual ACM symposium on Parallel algorithms and architectures,
pages 119–128, 2000.
[14] Mingzhe Chen, Zhaohui Yang, Walid Saad, Changchuan Yin, H Vincent Poor,
and Shuguang Cui. A joint learning and communications framework for feder-
ated learning over wireless networks. IEEE Transactions on Wireless Commu-
nications, 20(1):269–283, 2020.
[15] Xianfu Chen, Honggang Zhang, Celimuge Wu, Shiwen Mao, Yusheng Ji, and
Medhi Bennis. Optimized computation offloading performance in virtual edge
computing systems via deep reinforcement learning. IEEE Internet of Things
Journal, 6(3):4005–4018, 2018.
[16] Yae Jee Cho, Jianyu Wang, and Gauri Joshi. Client selection in federated
120
learning: Convergence analysis and power-of-choice selection strategies. arXiv
preprint arXiv:2010.01243, 2020.
[18] David Deutsch and Richard Jozsa. Rapid solution of problems by quantum
computation. Proceedings: Mathematical and Physical Sciences, 439(1907):553–
558, Dec. 1992.
[19] Maciej Drwal and Jerzy Józefczyk. Decentralized approximation algorithm for
data placement problem in content delivery networks. In Doctoral Conference
on Computing, Electrical and Industrial Systems, pages 85–92. Springer, 2012.
[20] Nima Eshraghi and Ben Liang. Joint offloading decision and resource allocation
with uncertain task computing requirement. In IEEE INFOCOM 2019-IEEE
Conference on Computer Communications, pages 1414–1422. IEEE, 2019.
[21] Lei Fan and Zhu Han. Hybrid quantum-classical computing for future network
optimization. IEEE Network, 36(5):72–76, Nov. 2022.
[22] Vajiheh Farhadi, Fidan Mehmeti, Ting He, Tom La Porta, Hana Khamfroush,
Shiqiang Wang, and Kevin S Chan. Service placement and request scheduling
for data-intensive applications in edge clouds. In IEEE INFOCOM 2019-IEEE
Conference on Computer Communications, pages 1279–1287. IEEE, 2019.
[23] Fred Glover, Gary Kochenberger, Rick Hennig, and Yu Du. Quantum bridge
analytics i: a tutorial on formulating and using qubo models. 4OR-Q J Oper
Res, 17:335–371, Nov. 2019.
[24] Lov K. Grover. A fast quantum mechanical algorithm for database search.
In Proceedings of the Twenty-Eighth Annual ACM Symposium on Theory of
Computing, STOC ’96, New York, NY, May 1996.
[25] Hongzhi Guo, Jiajia Liu, and Jianfeng Lv. Toward intelligent task offloading
at the edge. IEEE Network, 34(2):128–134, 2019.
121
[26] Wei Guo and Xinjun Wang. A data placement strategy based on genetic algo-
rithm in cloud computing platform. In 2013 10th Web Information System and
Application Conference, pages 369–372. IEEE, 2013.
[27] Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, Jan. 2023.
[28] Chamseddine Hamdeni, Tarek Hamrouni, and F Ben Charrada. Data popularity
measurements in distributed systems: Survey and design directions. Journal of
Network and Computer Applications, 72:150–161, 2016.
[29] Liang Huang, Suzhi Bi, and Ying-Jun Angela Zhang. Deep reinforcement learn-
ing for online computation offloading in wireless powered mobile-edge com-
puting networks. IEEE Transactions on Mobile Computing, 19(11):2581–2593,
2019.
[30] Yaodong Huang, Xintong Song, Fan Ye, Yuanyuan Yang, and Xiaoming Li. Fair
and efficient caching algorithms and strategies for peer data sharing in perva-
sive edge computing environments. IEEE Transactions on Mobile Computing,
19(4):852–864, 2019.
[31] Yaodong Huang, Jiarui Zhang, Jun Duan, Bin Xiao, Fan Ye, and Yuanyuan
Yang. Resource allocation and consensus on edge blockchain in pervasive edge
computing environments. In 2019 IEEE 39th International Conference on Dis-
tributed Computing Systems (ICDCS), pages 1476–1486. IEEE, 2019.
[33] S. Ji, W. Jiang, A. Walid, and X. Li. Dynamic sampling and selective mask-
ing for communication-efficient federated learning. IEEE Intelligent Systems,
37(02):27–34, Mar. 2022.
[34] Shaoxiong Ji, Wenqi Jiang, Anwar Walid, and Xue Li. Dynamic sampling and
selective masking for communication-efficient federated learning. arXiv preprint
arXiv:2003.09603, 2020.
122
[35] Yibo Jin, Lei Jiao, Zhuzhong Qian, Sheng Zhang, and Sanglu Lu. Learning for
learning: Predictive online control of federated learning with edge provisioning.
In IEEE INFOCOM 2021-IEEE Conference on Computer Communications,
pages 1–10. IEEE, 2021.
[36] Yibo Jin, Lei Jiao, Zhuzhong Qian, Sheng Zhang, Sanglu Lu, and Xiaoliang
Wang. Resource-efficient and convergence-preserving online participant selec-
tion in federated learning. In IEEE International Conference on Distributed
Computing Systems (ICDCS), 2020.
[37] Junghoon Kim, Taejoon Kim, Morteza Hashemi, Christopher G Brinton, and
David J Love. Joint optimization of signal design and resource allocation in
wireless D2D edge computing. In IEEE INFOCOM 2020-IEEE Conference on
Computer Communications, pages 2086–2095. IEEE, 2020.
[38] Simon Knight, Hung X Nguyen, Nickolas Falkner, Rhys Bowden, and Matthew
Roughan. The internet topology zoo. IEEE Journal on Selected Areas in Com-
munications, 29(9):1765–1775, Oct. 2011.
[39] Alex Krizhevsky. Learning multiple layers of features from tiny images. Tech-
nical report, University of Toronto, Apr. 2009.
[40] Fan Lai, Xiangfeng Zhu, Harsha V Madhyastha, and Mosharaf Chowdhury.
Oort: Efficient federated learning via guided participant selection. In Pro-
ceedings of the 15th USENIX Symposium on Operating Systems Design and
Implementation (OSDI), Virtual, Jul. 2021.
[41] Phu Lai, Qiang He, Mohamed Abdelrazek, Feifei Chen, John Hosking, John
Grundy, and Yun Yang. Optimal edge user allocation in edge computing with
variable sized vector bin packing. In International Conference on Service-
Oriented Computing, pages 230–245. Springer, 2018.
123
[43] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-
based learning applied to document recognition. Proceedings of the IEEE,
86(11):2278–2324, 1998.
[44] Lin-Wen Lee, Peter Scheuermann, and Radek Vingralek. File assignment in
parallel I/O systems with minimal variance of service time. IEEE Transactions
on Computers, 49(2):127–140, 2000.
[45] Chunlin Li, Jingpan Bai, and JianHang Tang. Joint optimization of data place-
ment and scheduling for improving user experience in edge computing. Journal
of Parallel and Distributed Computing, 125:93–105, 2019.
[46] Ji Li, Hui Gao, Tiejun Lv, and Yueming Lu. Deep reinforcement learning based
computation offloading and resource allocation for mec. In 2018 IEEE Wireless
Communications and Networking Conference (WCNC), pages 1–6. IEEE, 2018.
[47] Jun Li, Hao Wu, Bin Liu, Jianyuan Lu, Yi Wang, Xin Wang, YanYong Zhang,
and Lijun Dong. Popularity-driven coordinated caching in named data net-
working. In 2012 ACM/IEEE Symposium on Architectures for Networking and
Communications Systems (ANCS), pages 15–26. IEEE, 2012.
[48] Liang Li, Dian Shi, Ronghui Hou, Hui Li, Miao Pan, and Zhu Han. To talk
or to work: Flexible communication compression for energy efficient federated
learning over heterogeneous mobile edge devices. In IEEE INFOCOM 2021-
IEEE Conference on Computer Communications, pages 1–10. IEEE, 2021.
[49] Qiang Li, Kun Wang, Suwei Wei, Xuefeng Han, Lili Xu, and Min Gao. A
data placement strategy based on clustering and consistent hashing algorithm
in cloud computing. In 9th International Conference on Communications and
Networking in China, pages 478–483. IEEE, 2014.
[50] Ting Li, Zhijin Qiu, Lijuan Cao, Dazhao Cheng, Weichao Wang, Xinghua Shi,
and Yu Wang. Privacy-preserving participant grouping for mobile social sensing
over edge clouds. IEEE Transactions on Network Science and Engineering,
8(2):865–880, 2020.
124
[51] Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, and Zhihua Zhang. On
the convergence of FedAvg on non-IID data. arXiv preprint arXiv:1907.02189,
2019.
[52] Youqi Li, Fan Li, Lixing Chen, Liehuang Zhu, Pan Zhou, and Yu Wang. Power
of redundancy: Surplus client scheduling for federated learning against user
uncertainties. IEEE Transactions on Mobile Computing, 2022.
[53] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom
Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with
deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
[54] Wei Yang Bryan Lim, Nguyen Cong Luong, Dinh Thai Hoang, Yutao Jiao, Ying-
Chang Liang, Qiang Yang, Dusit Niyato, and Chunyan Miao. Federated learning
in mobile edge networks: A comprehensive survey. IEEE Communications
Surveys & Tutorials, 22(3):2031–2063, 2020.
[55] Bing Lin, Fangning Zhu, Jianshan Zhang, Jiaqing Chen, Xing Chen, Naixue N
Xiong, and Jaime Lloret Mauri. A time-driven data placement strategy for
a scientific workflow combining edge computing and cloud computing. IEEE
Transactions on Industrial Informatics, 15(7):4254–4265, 2019.
[56] Lumin Liu, Jun Zhang, SH Song, and Khaled B Letaief. Client-edge-cloud hier-
archical federated learning. In ICC 2020-2020 IEEE International Conference
on Communications (ICC), pages 1–6. IEEE, 2020.
[57] Yang Liu, Tong Feng, Mugen Peng, Jianfeng Guan, and Yu Wang. Dream:
Online control mechanisms for data aggregation error minimization in privacy-
preserving crowdsensing. IEEE Transactions on dependable and secure comput-
ing, 19(2):1266–1279, 2020.
[58] Siqi Luo, Xu Chen, Qiong Wu, Zhi Zhou, and Shuai Yu. HFEL: Joint edge
association and resource allocation for cost-efficient hierarchical federated edge
learning. IEEE Transactions on Wireless Communications, 19(10):6535–6548,
2020.
125
[59] Qin Lv, Pei Cao, Edith Cohen, Kai Li, and Scott Shenker. Search and replica-
tion in unstructured peer-to-peer networks. In Proceedings of the 16th interna-
tional conference on Supercomputing, pages 84–95, 2002.
[60] Chenxin Ma, Jakub Konečnỳ, Martin Jaggi, Virginia Smith, Michael I Jordan,
Peter Richtárik, and Martin Takáč. Distributed optimization with arbitrary
local solvers. Optimization Methods and Software, 32(4):813–848, 2017.
[61] Xiao Ma, Ao Zhou, Shan Zhang, and Shangguang Wang. Cooperative service
caching and workload scheduling in mobile edge computing. In IEEE INFO-
COM 2020-IEEE Conference on Computer Communications, pages 2076–2085.
IEEE, 2020.
[63] Ouiame Marnissi, Hajar El Hammouti, and El Houcine Bergou. Client se-
lection in federated learning based on gradients importance. arXiv preprint
arXiv:2111.11204, Nov. 2021.
[64] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and
Blaise Aguera y Arcas. Communication-efficient learning of deep networks from
decentralized data. In Artificial Intelligence and Statistics, pages 1273–1282.
PMLR, 2017.
[65] Qianyu Meng, Kun Wang, Xiaoming He, and Minyi Guo. Qoe-driven big data
management in pervasive edge computing environment. Big Data Mining and
Analytics, 1(3):222–233, 2018.
[66] Zeyu Meng, Hongli Xu, Min Chen, Yang Xu, Yangming Zhao, and Chunming
Qiao. Learning-driven decentralized machine learning in resource-constrained
wireless edge computing. In IEEE INFOCOM 2021-IEEE Conference on Com-
puter Communications, pages 1–10. IEEE, 2021.
126
[67] Erfan Meskar and Ben Liang. Fair multi-resource allocation in mobile edge
computing with multiple access points. In Proceedings of the Twenty-First
International Symposium on Theory, Algorithmic Foundations, and Protocol
Design for Mobile Networks and Mobile Computing, pages 11–20, 2020.
[68] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Tim-
othy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asyn-
chronous methods for deep reinforcement learning. In International conference
on machine learning, pages 1928–1937, 2016.
[69] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Ve-
ness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland,
Georg Ostrovski, et al. Human-level control through deep reinforcement learn-
ing. nature, 518(7540):529–533, 2015.
[70] Han Mongnam, Lee Youngseok, Moon Sue B., Jang Keon, and
Lee Dooyoung. Crawdad dataset kaist/wibro (v. 2008-06-04), 2020.
https://crawdad.org/kaist/wibro/20080604.
[71] Nuno Moniz and Lu’is Torgo. Multi-source social feedback of online news
feeds. CoRR, https://arxiv.org/abs/1801.07055, 2018.
[72] Samrat Nath and Jingxian Wu. Deep reinforcement learning for dynamic com-
putation offloading and resource allocation in cache-assisted mobile edge com-
puting systems. Intelligent and Converged Networks, 1(2):181–198, 2020.
[74] Minh NH Nguyen, Nguyen H Tran, Yan Kyaw Tun, Zhu Han, and Choong Seon
Hong. Toward multiple federated learning services resource sharing in mobile
edge networks. arXiv preprint arXiv:2011.12469, 2020.
[75] Zhaolong Ning, Peiran Dong, Xiaojie Wang, Joel JPC Rodrigues, and Feng
Xia. Deep reinforcement learning for vehicular edge computing: An intelligent
offloading system. ACM Transactions on Intelligent Systems and Technology
(TIST), 10(6):1–24, 2019.
127
[76] Takayuki Nishio and Ryo Yonetani. Client selection for federated learning with
heterogeneous resources in mobile edge. In ICC 2019-2019 IEEE International
Conference on Communications (ICC), pages 1–7. IEEE, 2019.
[77] Siyuan Niu and Aida Todri-Sanial. Effects of dynamical decoupling and pulse-
level optimizations on IBM quantum computers. IEEE Transactions on Quan-
tum Engineering, 3:1–10, Aug. 2022.
[78] Tao Ouyang, Rui Li, Xu Chen, Zhi Zhou, and Xin Tang. Adaptive user-managed
service placement for mobile edge computing: An online learning approach. In
IEEE INFOCOM 2019-IEEE Conference on Computer Communications, pages
1468–1476. IEEE, 2019.
[79] Stephen Pasteris, Shiqiang Wang, Mark Herbster, and Ting He. Service place-
ment with provable guarantees in heterogeneous edge computing systems. In
IEEE INFOCOM 2019-IEEE Conference on Computer Communications, pages
514–522. IEEE, 2019.
[82] Konstantinos Poularakis, Jaime Llorca, Antonia M Tulino, Ian Taylor, and
Leandros Tassiulas. Joint service placement and request routing in multi-cell
mobile edge computing networks. In IEEE INFOCOM 2019-IEEE Conference
on Computer Communications, pages 10–18. IEEE, 2019.
[83] John Preskill. Quantum computing in the NISQ era and beyond. Quantum,
2:79, Aug. 2018.
[84] G. M. Shafiqur Rahman, Tian Dang, and Manzoor Ahmed. Deep reinforcement
learning based computation offloading and resource allocation for low-latency
128
fog radio access networks. Intelligent and Converged Networks, 1(3):243–257,
2020.
[85] Ragheb Rahmaniani, Teodor Gabriel Crainic, Michel Gendreau, and Walter
Rei. The benders decomposition algorithm: A literature review. European
Journal of Operational Research, 259(3):801–817, Jun 2017.
[86] Google Research. Google cluster data (clusterdata 2011 traces), 2011.
https://github.com/google/cluster-data.
[88] Krzysztof Rzadca, Anwitaman Datta, and Sonja Buchegger. Replica placement
in P2P storage: Complexity and game theoretic analyses. In 2010 IEEE 30th
International Conference on Distributed Computing Systems, pages 599–609.
IEEE, 2010.
[89] Özlem Salehi, Adam Glos, and Jaroslaw Adam Miszczak. Unconstrained binary
models of the travelling salesman problem variants for quantum optimization.
Quantum Information Processing, 21(2):67, Jan. 2022.
[90] Gamal Sallam and Bo Ji. Joint placement and allocation of virtual network
functions with budget and capacity constraints. In IEEE INFOCOM 2019-
IEEE Conference on Computer Communications, pages 523–531. IEEE, 2019.
[91] Felix Sattler, Simon Wiedemann, Klaus-Robert Müller, and Wojciech Samek.
Robust and communication-efficient federated learning from non-iid data. IEEE
transactions on neural networks and learning systems, 31(9):3400–3413, 2019.
[92] Ohad Shamir, Nati Srebro, and Tong Zhang. Communication-efficient dis-
tributed optimization using an approximate newton-type method. In Inter-
national conference on machine learning (ICML), Beijing, China, Jun. 2014.
[93] Yanling Shao, Chunlin Li, and Hengliang Tang. A data replica placement strat-
egy for iot workflows in collaborative edge and cloud environments. Computer
Networks, 148:46–59, 2019.
129
[94] Dian Shen, Junzhou Luo, Fang Dong, and Junxue Zhang. Virtco: joint coflow
scheduling and virtual machine placement in cloud data centers. Tsinghua
Science and Technology, 24(5):630–644, 2019.
[95] Weisong Shi, Jie Cao, Quan Zhang, Youhuizi Li, and Lanyu Xu. Edge com-
puting: Vision and challenges. IEEE Internet of Things Journal, 3(5):637–646,
2016.
[96] Peter W. Shor. Polynomial-time algorithms for prime factorization and discrete
logarithms on a quantum computer. SIAM J. Comput., 26(5):1484–1509, Oct.
1997.
[97] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and
Martin Riedmiller. Deterministic policy gradient algorithms. In JMLR, 2014.
[100] Jing Tian, Zhi Yang, and Yafei Dai. A data placement scheme with time-
related model for P2P storages. In Seventh IEEE International Conference on
Peer-to-Peer Computing (P2P 2007), pages 151–158. IEEE, 2007.
[101] Nguyen H Tran, Wei Bao, Albert Zomaya, Minh NH Nguyen, and Choong Seon
Hong. Federated learning over wireless networks: Optimization model design
and analysis. In IEEE INFOCOM 2019-IEEE Conference on Computer Com-
munications, pages 1387–1395. IEEE, 2019.
[102] Tony Tran, Minh Do, Eleanor Rieffel, Jeremy Frank, Zhihui Wang, Bryan
O’Gorman, Davide Venturelli, and J Beck. A hybrid quantum-classical approach
to solving scheduling problems. In Proceedings of the International Symposium
on Combinatorial Search, New York, USA, Jul. 2016.
[103] Manghui Tu, Hui Ma, Liangliang Xiao, I-Ling Yen, Farokh Bastani, and Di-
anxiang Xu. Data placement in P2P data grids considering the availability,
130
security, access performance and load balancing. Journal of grid computing,
11(1):103–127, 2013.
[104] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning
with double Q-learning. In Proceedings of the AAAI conference on artificial
intelligence, 2016.
[105] Pauli Virtanen, Ralf Gommers, et al. SciPy 1.0: Fundamental Algorithms for
Scientific Computing in Python. Nature Methods, 17:261–272, Feb. 2020.
[106] Jiadai Wang, Lei Zhao, Jiajia Liu, and Nei Kato. Smart resource allocation
for mobile edge computing: A deep reinforcement learning approach. IEEE
Transactions on emerging topics in computing, 2019.
[107] Mingjun Wang, Jinghui Zhang, Fang Dong, and Junzhou Luo. Data placement
and task scheduling optimization for data intensive scientific workflow in mul-
tiple data centers environment. In 2014 Second International Conference on
Advanced Cloud and Big Data, pages 77–84. IEEE, 2014.
[108] Shiqiang Wang, Tiffany Tuor, Theodoros Salonidis, Kin K Leung, Christian
Makaya, Ting He, and Kevin Chan. Adaptive federated learning in resource
constrained edge computing systems. IEEE Journal on Selected Areas in Com-
munications, 37(6):1205–1221, 2019.
[109] Tao Wang, Shihong Yao, Zhengquan Xu, and Shan Jia. DCCP: an effective
data placement strategy for data-intensive computations in distributed cloud
computing systems. The Journal of Supercomputing, 72(7):2537–2564, 2016.
[110] Ying Wang, Yifan Dong, Songtao Guo, Yuanyuan Yang, and Xiaofeng Liao.
Latency-aware adaptive video summarization for mobile edge clouds. IEEE
Transactions on Multimedia, 22(5):1193–1207, 2019.
[111] Yu Wang and Xiang-Yang Li. Efficient Delaunay-based localized routing for
wireless sensor networks. Wiley International Journal of Communication Sys-
tem, 20(7):767–789, 2007.
131
[112] Yu Wang, Chih-Wei Yi, Minsu Huang, and Fan Li. Three dimensional greedy
routing in large-scale random wireless sensor networks. Ad Hoc Networks Jour-
nal, 11(4):1331–1344, 2013.
[113] Zhiyuan Wang, Hongli Xu, Jianchun Liu, He Huang, Chunming Qiao, and
Yangming Zhao. Resource-efficient federated learning with hierarchical aggre-
gation in edge computing. In IEEE INFOCOM 2021-IEEE Conference on Com-
puter Communications, pages 1–10. IEEE, 2021.
[115] Qingsong Wei, Bharadwaj Veeravalli, Bozhao Gong, Lingfang Zeng, and Dan
Feng. CDRM: A cost-effective dynamic replication management scheme for
cloud storage cluster. In 2010 IEEE international conference on cluster com-
puting, pages 188–196. IEEE, 2010.
[116] Xinliang Wei, Jiyao Liu, Xinghua Shi, and Yu Wang. Participant selection for
hierarchical federated learning in edge clouds. In IEEE International Conference
on Networking, Architecture, and Storage (NAS 2022), 2022.
[117] Xinliang Wei, Jiyao Liu, and Yu Wang. Joint participant selection and learning
scheduling for multi-model federated edge learning. In IEEE 19th International
Conference on Mobile Ad Hoc and Smart Systems (MASS), Denver, CO, Oct.
2022.
[118] Xinliang Wei, ABM Mohaimenur Rahman, and Yu Wang. Data placement
strategies for data-intensive computing over edge clouds. In 2021 IEEE Inter-
national Performance, Computing, and Communications Conference (IPCCC),
pages 1–8. IEEE, 2021.
[119] Xinliang Wei and Yu Wang. Popularity-based data placement with load bal-
ancing in edge computing. IEEE Transactions on Cloud Computing, 2021.
[120] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a novel image
dataset for benchmarking machine learning algorithms, 2017.
132
[121] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image
dataset for benchmarking machine learning algorithms. ArXiv, abs/1708.07747,
Aug. 2017.
[122] Junjie Xie, Chen Qian, Deke Guo, Xin Li, Shouqian Shi, and Honghui Chen.
Efficient data placement and retrieval services in edge computing. In 2019 IEEE
39th International Conference on Distributed Computing Systems (ICDCS),
pages 1029–1039. IEEE, 2019.
[123] Junjie Xie, Chen Qian, Deke Guo, Minmei Wang, Shouqian Shi, and Honghui
Chen. Efficient indexing mechanism for unstructured data sharing systems in
edge computing. In IEEE INFOCOM 2019-IEEE Conference on Computer
Communications, pages 820–828. IEEE, 2019.
[124] Jie Xu, Lixing Chen, and Pan Zhou. Joint service caching and task offloading
for mobile edge computing in dense networks. In IEEE INFOCOM 2018-IEEE
Conference on Computer Communications, pages 207–215. IEEE, 2018.
[125] Qiang Xu, Zhengquan Xu, Tao Wang, et al. A data-placement strategy based
on genetic algorithm in cloud computing. International Journal of Intelligence
Science, 5(03):145, 2015.
[126] Zichuan Xu, Lizhen Zhou, Sid Chi-Kin Chau, Weifa Liang, Qiufen Xia, and
Pan Zhou. Collaborate or separate? distributed service caching in mobile edge
clouds. In IEEE INFOCOM 2020-IEEE Conference on Computer Communi-
cations, pages 2066–2075. IEEE, 2020.
[127] Lei Yang, Haipeng Yao, Jingjing Wang, Chunxiao Jiang, Abderrahim Bensli-
mane, and Yunjie Liu. Multi-uav-enabled load-balance mobile-edge computing
for iot networks. IEEE Internet of Things Journal, 7(8):6898–6908, 2020.
[128] Song Yang, Nan He, Fan Li, Stojan Trajanovski, Xu Chen, Yu Wang, and
Xiaoming Fu. Survivable task allocation in cloud radio access networks with
mobile edge computing. IEEE Internet of Things Journal, 8(2):1095–1108,
2020.
133
[129] Song Yang, Fan Li, Meng Shen, Xu Chen, Xiaoming Fu, and Yu Wang. Cloudlet
placement and task allocation in mobile edge computing. IEEE Internet of
Things Journal, 6(3):5853–5863, 2019.
[130] Song Yang, Fan Li, Stojan Trajanovski, Xu Chen, Yu Wang, and Xiaoming Fu.
Delay-aware virtual network function placement and routing in edge clouds.
IEEE Transactions on Mobile Computing, 20(2):445 – 459, 2021.
[131] Zhaohui Yang, Mingzhe Chen, Walid Saad, Choong Seon Hong, and Mohammad
Shikh-Bahaei. Energy efficient federated learning over wireless communication
networks. IEEE Transactions on Wireless Communications, 20(3):1935–1949,
2020.
[132] Wencong You, Lei Jiao, Sourav Bhattacharya, and Yuan Zhang. Dynamic
distributed edge resource provisioning via online learning across timescales. In
IEEE SECON, 2020.
[133] Dong Yuan, Yun Yang, Xiao Liu, and Jinjun Chen. A data placement strategy
in scientific cloud workflows. Future Generation Computer Systems, 26(8):1200–
1214, 2010.
[134] Chen Zhang, Hongwei Du, Qiang Ye, Chuang Liu, and He Yuan. DMRA: a
decentralized resource allocation scheme for Multi-SP mobile edge computing.
In 2019 IEEE 39th International Conference on Distributed Computing Systems
(ICDCS), pages 390–398. IEEE, 2019.
[135] Jiale Zhang, Bing Chen, Yanchao Zhao, Xiang Cheng, and Feng Hu. Data
security and privacy-preserving in edge computing paradigm: Survey and open
issues. IEEE access, 6:18209–18237, 2018.
[136] Jie Zhang, Hongzhi Guo, Jiajia Liu, and Yanning Zhang. Task offloading in
vehicular edge computing networks: A load-balancing solution. IEEE Transac-
tions on Vehicular Technology, 69(2):2092–2104, 2019.
[137] Wei Zhang, Xiao Chen, and Jianhui Jiang. A multi-objective optimization
method of initial virtual machine fault-tolerant placement for star topological
134
data centers of cloud systems. Tsinghua Science and Technology, 26(1):95–111,
2021.
[138] Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. Character-level convolutional
networks for text classification. In NIPS, 2015.
[139] L. Zhao and J. Liu. Optimal placement of virtual machines for supporting
multiple applications in mobile edge networks. IEEE Transactions on Vehicular
Technology, 67(7):6533–6545, July 2018.
[140] L. Zhao, W. Sun, Y. Shi, and J. Liu. Optimal placement of cloudlets for access
delay minimization in sdn-based internet of things networks. IEEE Internet of
Things Journal, 5(2):1334–1344, April 2018.
[141] Qing Zhao, Congcong Xiong, Xi Zhao, Ce Yu, and Jian Xiao. A data place-
ment strategy for data-intensive scientific workflows in cloud. In 2015 15th
IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing,
pages 928–934. IEEE, 2015.
[142] Zhongqi Zhao, Lei Fan, and Zhu Han. Hybrid quantum benders’ decomposition
for mixed-integer linear programming. In IEEE Wireless Communications and
Networking Conference (WCNC), Austin, TX, Apr. 2022.
[143] Han-Sen Zhong, Hui Wang, Yu-Hao Deng, Ming-Cheng Chen, Li-Chao Peng,
Yi-Han Luo, Jian Qin, Dian Wu, Xing Ding, Yi Hu, et al. Quantum computa-
tional advantage using photons. Science, 370(6523):1460–1463, Dec. 2020.
[144] Hongbin Zhu, Yong Zhou, Hua Qian, Yuanming Shi, Xu Chen, and Yang Yang.
Online client selection for asynchronous federated learning with fairness con-
sideration. IEEE Transactions on Wireless Communications, Oct. 2022. doi:
10.1109/TWC.2022.3211998.
135