Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

1

Efficient Federated Meta-Learning over


Multi-Access Wireless Networks
Sheng Yue, Student Member, IEEE, Ju Ren, Member, IEEE, Jiang Xin, Student Member, IEEE, Deyu Zhang,
Member, IEEE, Yaoxue Zhang, Senior Member, IEEE, and Weihua Zhuang, Fellow, IEEE

Abstract—Federated meta-learning (FML) has emerged as under a common theme of fostering edge-edge collaboration
a promising paradigm to cope with the data limitation and [6]–[8]. In FML, IoT devices join forces to learn an initial
arXiv:2108.06453v4 [cs.LG] 11 Nov 2021

heterogeneity challenges in today’s edge learning arena. How- shared model under the orchestration of a central server such
ever, its performance is often limited by slow convergence and
corresponding low communication efficiency. In addition, since that current or new devices can quickly adapt the learned
the available radio spectrum and IoT devices’ energy capacity model to their local datasets via one or a few gradient descent
are usually insufficient, it is crucial to control the resource steps. Notably, FML can keep all the benefits of the federated
allocation and energy consumption when deploying FML in learning paradigm (such as simplicity, data security, and flexi-
practical wireless networks. To overcome the challenges, in this bility), while giving a more personalized model for each device
paper, we rigorously analyze the contribution of each device to
the global loss reduction in each round and develop an FML to capture the differences among tasks [9]. Therefore, FML has
algorithm (called NUFM) with a non-uniform device selection emerged as a promising approach to tackle the heterogeneity
scheme to accelerate the convergence. After that, we formulate challenges in federated learning and to facilitate efficient edge
a resource allocation problem integrating NUFM in multi- learning [10].
access wireless systems to jointly improve the convergence rate Despite its promising benefits, FML comes with new chal-
and minimize the wall-clock time along with energy cost. By
deconstructing the original problem step by step, we devise a lenges. On one hand, the number of participated devices can be
joint device selection and resource allocation strategy to solve the enormous. The uniform device selection at random, often done
problem with theoretical guarantees. Further, we show that the in the existing methods, leads to a low convergence speed [9],
computational complexity of NUFM can be reduced from 𝑂 (𝑑 2 ) [11]. Although recent studies [12]–[14] have characterized the
to 𝑂 (𝑑) (with the model dimension 𝑑) via combining two first- convergence of federated learning and proposed non-uniform
order approximation techniques. Extensive simulation results
demonstrate the effectiveness and superiority of the proposed device selection mechanisms, they cannot be directly applied
methods in comparison with existing baselines. to FML problems due to the bias and high-order information
in the stochastic gradients. On the other hand, the performance
Index Terms—Federated meta-learning, multi-access systems,
device selection, resource allocation, efficiency. of FML in a wireless environment is highly related to its
wall-clock time, including the computation time (determined
I. I NTRODUCTION by local data sizes and devices’ CPU types) and communi-
cation time (depending on channel gains, interference, and
HE integration of Artificial Intelligence (AI) and Internet-
T of-Things (IoT) has led to a proliferation of studies on
edge intelligence, aiming at pushing AI frontiers to the wire-
transmission power) [15], [16]. If not properly controlled, a
large wall-clock time can cause unexpected training delay
and communication inefficiency. In addition, the DNN model
less network edge proximal to IoT devices and data sources
training, which involves a large number of data samples and
[1]. It is expected that edge intelligence will reduce time-
epochs, usually induces high computational cost, especially
to-action latency down to milliseconds for IoT applications
for sophisticated model structures consisting of millions of
while minimizing network bandwidth and offering security
parameters [15]. Due to the limited power capacity of IoT
guarantees [2]. However, a general consensus is that a single
devices, energy consumption should also be properly managed
IoT device can hardly realize edge intelligence due to its lim-
to ensure system sustainability and stability [14]. In a nutshell,
ited computational and storage capabilities. Accordingly, it is
for the purpose of efficiently deploying FML in today’s
natural to rely on collaboration in edge learning, whereby IoT
wireless systems, the strategy of device selection and resource
devices work together to accomplish computation-intensive
allocation must be carefully crafted not only to accelerate
tasks [3].
the learning process, but also to control the wall-clock time
Building on a synergy of federated learning [4] and meta-
of training and energy cost in edge devices. Unfortunately,
learning [5], federated meta-learning (FML) has been proposed
despite their importance, there are limited studies on these
Sheng Yue, Jiang Xin, Deyu Zhang are with the School of Computer aspects in the current literature.
Science and Engineering, Central South University, Changsha, 410083 China. In this paper, we tackle the above-mentioned challenges
Emails: {sheng.yue, xinjiang, zdy876}@csu.edu.cn.
Ju Ren and Yaoxue Zhang are with the Department of Computer Sci- in two steps: 1) We develop an algorithm (called NUFM)
ence and Technology, Tsinghua University, Beijing, 100084 China. Emails: with a non-uniform device selection scheme to improve the
{renju,zhangyx}@tsinghua.edu.cn. convergence rate of vanilla FML algorithm; 2) based on
Weihua Zhuang is with the Department of Electrical and Com-
puter Engineering, University of Waterloo, Waterloo, ON, Canada. Email: NUFM, we propose a resource allocation strategy (called
wzhuang@uwaterloo.ca. URAL) that jointly optimizes the convergence speed, wall-
2

clock time, and energy consumption in the context of multi- II. R ELATED W ORK
access wireless systems. More specifically, first we rigorously
quantify the contribution of each device to the convergence of Federated learning (FL) [4] has been proposed as a promis-
FML via deriving a tight lower bound on the reduction of one- ing technique to facilitate edge-edge collaborative learning
round global loss. Based on the quantitative results, we present [17]. However, due to the heterogeneity in devices, models,
the non-uniform device selection scheme that maximizes the and data distributions, a shared global model often fails to
loss reduction per round, followed by the NUFM algorithm. capture the individual information of each device, leading to
Then, we formulate a resource allocation problem for NUFM performance degradation in inference or classification [10],
over wireless networks, capturing the trade-offs among conver- [18], [19].
gence, wall-clock time, and energy cost. To solve this problem, Very recently, based on the advances in meta-learning [5],
we exploit its special structure and decompose it into two federated meta-learning (FML) has garnered much attention,
sub-problems. The first one is to minimize the computation which aims to learn a personalized model for each device to
time via controlling devices’ CPU-cycle frequencies, which is cope with the heterogeneity challenges [6]–[9], [11]. Chen et
solved optimally based on the analysis of the effect of device al. [6] first introduce an FML method called FedMeta, integrat-
heterogeneity on the objective. The second sub-problem aims ing the model-agnostic meta-learning (MAML) algorithm [5]
at optimizing the resource block allocation and transmission into the federated learning framework. They show that FML
power management. It is a non-convex mixed-integer non- can significantly improve the performance of FedAvg [4].
linear programming (MINLP) problem, and to derive a closed- Jiang et al. [7] analyze the connection between FedAvg and
form solution is a non-trivial task. Thus, after deconstructing MAML, and empirically demonstrate that FML enables better
the problem step by step, we devise an iterative method to and more stable personalized performance. From a theoretical
solve it and provide a convergence guarantee. perspective, Lin et al. [8] analyze the convergence properties
and computational complexity of FML with strongly convex
In summary, our main contributions are three-fold. loss functions and exact gradients. Fallah et al. [9] fur-
ther provide the convergence guarantees in non-convex cases
• We provide a theoretical characterization of contribution with stochastic gradients. Different from the above gradient
of an individual device to the convergence of FML in descent–based approaches, another recent work [11] develops
each round, via establishing a tight lower bound on the an ADMM-based FML method and gives its convergence
one-round reduction of expected global loss. Using this guarantee under non-convex cases. However, due to selecting
quantitative result, we develop NUFM, a fast-convergent devices uniformly at random, the existing FML algorithms
FML algorithm with non-uniform device selection; often suffer from slow convergence and low communication
• To embed NUFM in the context of multi-access wireless efficiency [9]. Further, deploying FML in practical wireless
systems, we formulate a resource allocation problem, systems calls for effective resource allocation strategies [15],
capturing the trade-offs among the convergence, wall- [20], which is beyond the scope of existing FML literature.
clock time, and energy consumption. By decomposing There exists a significant body of works placing interests in
the original problem into two sub-problems and de- convergence improvement and resource allocation for FL [13]–
constructing sub-problems step by step, we propose a [15], [21]–[36]. Regarding the convergence improvement,
joint device selection and resource allocation algorithm Nguyen et al. [13] propose a fast-convergent FL algorithm,
(namely URAL) to solve the problem effectively with called FOLB, which achieves a near-optimal lower bound for
theoretical performance guarantees; the overall loss decrease in each round. Note that, while the
• To reduce the computational complexity, we further idea of NUFM is similar to [13], the lower bound for FML
integrate our proposed algorithms with two first-order is derived from a completely different technical path due to
approximation techniques in [9], by which the complexity the inherent complexity in the local update. To minimize the
of a one-step update in NUFM can be reduced from convergence time, a probabilistic device selection scheme for
𝑂 (𝑑 2 ) to 𝑂 (𝑑). We also show that our theoretical results FL is designed in [25], which assigns high probabilities to
hold in these cases; the devices with large effects on the global model. Ren et al.
• We provide extensive simulation results on challenging [26] investigate a batchsize selection strategy for accelerating
real-world benchmarks (i.e., Fashion-MNIST, CIFAR-10, the FL training process. Karimireddy et al. [30] employ the
CIFAR-100, and ImageNet) to demonstrate the efficacy variance reduction technique to develop a new FL algorithm.
of our methods. Based on momentum methods, Yu et al. [27] give an FL
algorithm with linear speedup property. Regarding the resource
The remainder of this paper is organized as follows. Sec- allocation in FL, Dinh et al. [15] embed FL in wireless
tion II briefly reviews the related work. Section III introduces networks, considering the trade-offs between training time and
the FML problem and standard algorithm. We present the non- energy consumption, under the assumption that all devices
uniform device selection scheme in Section IV and adapt the participate in the whole training process. From a long-term
scheme for wireless networks in Section V. Finally, Section VI perspective, Xu et al. [28] empirically investigate the device
presents the extension to first-order approximation techniques, selection scheme jointly with bandwidth allocation for FL,
followed by the simulation results in Section VII and a using the Lyapunov optimization method. Chen et al. [14]
conclusion drawn in Section VIII. investigate a device selection problem with “hard” resource
3

constraints to enable the implementation of FL over wireless • Local update: At the beginning of each round 𝑘, the
networks. Wang et al. [29] propose a control algorithm to server first sends the current global model 𝜃 𝑘 to a fraction
determine the best trade-off between the local update and of devices N𝑘 chosen uniformly at random with pre-
global aggregation under a resource budget. set size 𝑛 𝑘 . Then, each device 𝑖 ∈ N𝑘 updates the
Although extensive research has been carried out on FL, received model based on its meta-function 𝐹𝑖 (𝜃) B
researchers have not treated FML in much detail. In particular, 𝑓𝑖 (𝜃 − 𝛼∇ 𝑓𝑖 (𝜃)) by running 𝜏 (≥ 1) steps of stochastic
the existing FL acceleration techniques cannot be directly gradient descent locally (also called mini-batch gradient
applied to FML due to the high-order information and biased descent), i.e.,
stochastic gradients in the local update phases1 (see Lemma ˜ 𝑖 (𝜃 𝑘,𝑡 ), for 0 ≤ 𝑡 ≤ 𝜏 − 1
𝜃 𝑖𝑘,𝑡+1 = 𝜃 𝑖𝑘,𝑡 − 𝛽 ∇𝐹 (2)
2 in Section IV). At the same time, device selection and 𝑖
resource allocation require crafting jointly [6], rather than where 𝜃 𝑡𝑘 denotes the local model of device 𝑖 in the 𝑡-th
simply plugging the existing strategies in. step of the local update in round 𝑘 with 𝜃 𝑖𝑘,0 = 𝜃 𝑘 , and
𝛽 > 0 is the meta–learning rate. In (2), the stochastic
III. P RELIMINARIES AND A SSUMPTIONS ˜ 𝑖 (𝜃) is given by
gradient ∇𝐹
In this section, we introduce federated meta-learning, in- ˜ 𝑖 (𝜃) B 𝐼 − 𝛼 ∇˜ 2 𝑓𝑖 (𝜃, D𝑖00) ∇˜ 𝑓𝑖 𝜃 − 𝛼 ∇˜ 𝑓𝑖 (𝜃, D𝑖 ), D𝑖0
 
∇𝐹
cluding the learning problem, standard algorithm, and assump-
(3)
tions for theoretical analysis.
where D𝑖 , D𝑖0, and D𝑖00 are independent batches2 , and for
A. Federated Meta-Learning Problem any batch D, ∇˜ 𝑓𝑖 (𝜃, D) and ∇˜ 2 𝑓𝑖 (𝜃, D) are the unbiased
esimates of ∇ 𝑓𝑖 (𝜃) and ∇2 𝑓𝑖 (𝜃) respectively, i.e.,
We consider a set N of user devices that are all con-
nected to a server. Each device 𝑖 ∈ N has a labeled dataset 1 ∑︁
𝑗 𝑗 ∇˜ 𝑓𝑖 (𝜃, D) B ∇ℓ𝑖 (𝜃; x, y) (4)
D𝑖 = {x𝑖 , y𝑖 } 𝐷 𝑗=1 that can be accessed only by itself. Here,
𝑖
|D |
(x,y) ∈D
𝑗 𝑗 𝑗
the tuple (x𝑖 , y𝑖 ) ∈ X × Y is a data sample with input x𝑖 1 ∑︁
𝑗
and label y𝑖 , and follows an unknown underlying distribution ∇˜ 2 𝑓𝑖 (𝜃, D) B ∇2 ℓ𝑖 (𝜃; x, y). (5)
|D |
𝑃𝑖 . Define 𝜃 as the model parameter, such as the weights of (x,y) ∈D

a Deep Neural Network (DNN) model. For device 𝑖, the loss • Global aggregation: After updating the local model
function of a model parameter 𝜃 ∈ R𝑑 is defined as ℓ𝑖 (𝜃; x, y), parameter, each selected device sends its local model
which measures the error of model 𝜃 in predicting the true 𝜃 𝑖𝑘 = 𝜃 𝑖𝑘, 𝜏−1 to the server. The server updates the global
label y given input x. model by averaging over the received models, i.e.,
Federated meta-learning (FML) looks for a good model 1 ∑︁ 𝑘
initialization (also called meta-model) such that the well- 𝜃 𝑘+1 = 𝜃 . (6)
𝑛 𝑘 𝑖 ∈N 𝑖
performed models of different devices can be quickly obtained 𝑘

via one or a few gradient descent steps. More specially, FML It is easy to see that the main difference between federated
aims to solve the following problem learning and FML lies in the local update phase: In federated
1 ∑︁ learning, local update is done using the unbiased gradient esti-
min 𝐹 (𝜃) B 𝑓𝑖 (𝜃 − 𝛼∇ 𝑓𝑖 (𝜃)) (1) mates while FML uses the biased one consisting of high-order
𝜃 ∈R 𝑑 𝑛 𝑖 ∈N
information. Besides, federated learning can be considered as
where 𝑓𝑖 represents the expected loss function over the data a special case of FML, i.e., FML under 𝛼 = 0.
distribution of device 𝑖, i.e., 𝑓𝑖 (𝜃) B E (x,y)∼𝑃𝑖 [ℓ𝑖 (𝜃; x, y)],
𝑛 = |N | is the number of devices, and 𝛼 is the stepsize.
C. Assumptions
The advantages of this formulation are two-fold: 1) It gives
a personalized solution that can capture any heterogeneity In this subsection, we list the standard assumptions for the
between the devices; 2) the meta-model can quickly adapt to analysis of FML algorithms [8], [9], [11].
new devices via slightly updating it with respect to their own Assumption 1 (Smoothness). The expected loss function
data. Clearly, FML well fits edge learning cases, where edge 𝑓𝑖 corresponding to device 𝑖 ∈ N is twice continuously
devices have insufficient computing power and limited data differentiable and 𝐿 𝑖 -smooth, i.e.,
samples.
Next, we review the standard FML algorithm in the litera- k∇ 𝑓𝑖 (𝜃 1 ) − ∇ 𝑓𝑖 (𝜃 2 ) k ≤ 𝐿 𝑖 k𝜃 1 − 𝜃 2 k, ∀𝜃 1 , 𝜃 2 ∈ R𝑑 . (7)
ture.
Besides, its gradient is bounded by a positive constant 𝜁𝑖 , i.e.,
k∇ 𝑓𝑖 (𝜃) k ≤ 𝜁𝑖 .
B. Standard Algorithm
Assumption 2 (Lipschitz Hessian). The Hessian of function
Similar to federated learning, vanilla FML algorithm solves
𝑓𝑖 is 𝜌𝑖 -Lipschitz continuous for each 𝑖 ∈ N , i.e.,
(1) in two repeating steps: local update and global aggregation
[9], as detailed below. k∇2 𝑓𝑖 (𝜃 1 ) − ∇2 𝑓𝑖 (𝜃 2 ) k ≤ 𝜌𝑖 k𝜃 1 − 𝜃 2 k, ∀𝜃 1 , 𝜃 2 ∈ R𝑑 . (8)
1 Different from FML, FL is a first-order method with unbiased stochastic 2 We slightly abuse the notation D as a batch of the local dataset of the
𝑖
gradients. 𝑖-th device.
4

TABLE I
K EY N OTATIONS
Notation Definition Notation Definition
𝜃 Model parameter 𝐹𝑖 ( ·) Meta–loss function
𝛼, 𝛽 Stepsize and meta–learning rate 𝐿𝑖 , 𝜌𝑖 Lipschitz continuous parameters
𝑖, 𝑗 Indexes of user devices and data samples 𝜎𝐺 , 𝜎𝐻 Upper bounds of variances
𝑘 Index of training rounds 𝛾𝐺 , 𝛾𝐻 Similarity parameters
𝑡 Index of local update steps 𝑢𝑖 Contribution to global loss reduction of device 𝑖
𝜏 Number of local update steps 𝑣𝑖 CPU-cycle frequency of device 𝑖
𝑃𝑖 Underlying data distribution of device 𝑖 𝑝𝑖 Transmission power of device 𝑖
D𝑖 , 𝐷𝑖 Dataset and its size of device 𝑖 𝑧𝑖,𝑚 Binary variable indicating if device 𝑖 accesses RB 𝑚
N Set of devices 𝜂1 , 𝜂2 Weight parameters
𝑛 Number of devices 𝑐𝑖 CPU cycles for computing one sample by device 𝑖
N𝑘 Set of participating devices in round 𝑘 ℎ𝑖 , 𝐼𝑚 Channel gain of device 𝑖 and inference in RB 𝑚
𝑛𝑘 Number of participating devices in round 𝑘 𝐵, 𝑁0 Bandwidth of each RB and noise power spectral density
x, y Input and corresponding label M, 𝑀 Set and number of RBs
𝑙𝑖 ( 𝜃; x, y) Loss of model 𝜃 on sample (x, y) 𝐸 Total energy consumption
𝑓𝑖 ( ·) Expected loss function 𝑇 Total latency

Assumption 3 (Bounded Variance). Given any 𝜃 ∈ R𝑑 , the Lemma 1. If Assumptions 1 and 2 hold, local meta-function
following facts hold for stochastic gradient ∇ℓ𝑖 (𝜃; x, y) and 𝐹𝑖 is smooth with parameter 𝐿 𝐹 B (1 + 𝛼𝐿) 2 𝐿 + 𝛼𝜌𝜁.
Hessians ∇2 ℓ𝑖 (𝜃; x, y) with (x, y) ∈ X × Y
Sketch of Proof. We expand the expression of k∇𝐹𝑖 (𝜃 1 ) −
E (x,y)∼𝑃𝑖 k∇ℓ𝑖 (𝜃; x, y) − ∇ 𝑓𝑖 (𝜃) k 2 ≤ 𝜎𝐺
2
 
(9) ∇𝐹𝑖 (𝜃 2 ) k into two parts by triangle inequality, followed by
h 2
i bounding each part via Assumptions 1 and 2. The detailed
E (x,y)∼𝑃𝑖 ∇2 ℓ𝑖 (𝜃; x, y) − ∇2 𝑓𝑖 (𝜃) ≤ 𝜎𝐻2

. (10)
proof is presented in Appendix A of the technical report
Assumption 4 (Similarity). For any 𝜃 ∈ R𝑑 and 𝑖, 𝑗 ∈ N , [39].
there exist nonnegtive constants 𝛾𝐺 ≥ 0 and 𝛾 𝐻 ≥ 0 such that Lemma 1 gives the smoothness of the local meta-function
the gradients and Hessians of the expected loss funtions 𝑓𝑖 (𝜃) 𝐹𝑖 and global loss function 𝐹.
and 𝑓 𝑗 (𝜃) satisfy the following conditions
Lemma 2. Suppose that Assumptions 1-3 are satisfied, and
k∇ 𝑓𝑖 (𝜃) − ∇ 𝑓 𝑗 (𝜃) k ≤ 𝛾𝐺 (11) D𝑖 , D𝑖0, and D𝑖00 are independent batches with sizes 𝐷 𝑖 , 𝐷 𝑖0 ,
k∇2 𝑓𝑖 (𝜃) − ∇2 𝑓 𝑗 (𝜃) k ≤ 𝛾 𝐻 . (12) and 𝐷 𝑖00 respectively. For any 𝜃 ∈ R𝑑 , the following holds

Assumption 2 implies the high-order smoothness of 𝑓𝑖 (𝜃) 


E ∇𝐹˜ 𝑖 (𝜃) − ∇𝐹𝑖 (𝜃) ≤ 𝛼𝜎𝐺 𝐿(1
 + 𝛼𝐿)
√ (13)
dealing with the second-order information in the local update 𝐷𝑖
step (2). Assumption 4 indicates that the variations of gradients
h i
E ∇𝐹˜ 𝑖 (𝜃) − ∇𝐹𝑖 (𝜃) 2 ≤ 𝜎 2 (14)
𝐹𝑖
between different devices are bounded by some constants,
which captures the similarities between devices’ tasks corre- where 𝜎𝐹2𝑖 is denoted as
sponding to non-IID data. It holds for many practical loss
(𝛼𝐿) 2 3(𝛼𝜁 𝜎𝐻 ) 2
 
functions [37], such as logistic regression and hyperbolic 1
𝑔
tangent functions. In particular, 𝜓𝑖 and 𝜓𝑖ℎ can be roughly 𝜎𝐹2𝑖 B 2
6𝜎𝐺 (1
+ 𝛼𝐿) 2
+ +
𝐷0 𝐷𝑖 𝐷 𝑖00
seen as a distance between data distributions 𝑃𝑖 and 𝑃 𝑗 [38]. 2
 𝑖 2

6(𝛼𝜎𝐺 𝜎𝐻 ) 1 (𝛼𝐿)
+ 00 0 + . (15)
𝐷𝑖 𝐷𝑖 𝐷𝑖
IV. N ON -U NIFORM F EDERATED M ETA -L EARNING
Due to the uniform selection of devices in each round, the Sketch of Proof. We first obtain
convergence rate of the standard FML algorithm is naturally ˜ 𝑖 (𝜃) − ∇𝐹𝑖 (𝜃) ≤ 𝐼 − 𝛼∇2 𝑓𝑖 (𝜃) E 𝛿∗ + E 𝛿∗ 𝛿∗
      
E ∇𝐹
2 1 2
slow. In this section, we present a non-uniform device selection
+ E 𝛿1∗ ∇ 𝑓𝑖 (𝜃 − 𝛼∇ 𝑓𝑖 (𝜃)) ,
 
scheme to tackle this challenge.  
˜ 𝑖 (𝜃) − ∇𝐹𝑖 (𝜃) k 2 ≤ 3 E k𝛿∗ k 2 E k𝛿∗ k 2
    
E k ∇𝐹 1 2
A. Device Contribution Quantification 
+ 𝜁 2 E k𝛿1∗ k 2 + (1 + 𝛼𝐿) 2 E k𝛿2∗ k 2 ,
  
We begin with quantifying the contribution of each device
to the reduction of one-round global loss using its dataset size where 𝛿1∗ and 𝛿2∗ are given by
and gradient norms. For convenience, define 𝜁 B max𝑖 𝜁𝑖 ,  
𝐿 B max𝑖 𝐿 𝑖 , and 𝜌 B max𝑖 𝜌𝑖 . We first provide necessary 𝛿1∗ = 𝛼 ∇2 𝑓𝑖 (𝜃) − ∇˜ 2 𝑓𝑖 (𝜃, D𝑖00)
lemmas before giving the main result. 𝛿2∗ = ∇˜ 𝑓𝑖 𝜃 − 𝛼 ∇˜ 𝑓𝑖 (𝜃, D𝑖 ), D𝑖0 − ∇ 𝑓𝑖 (𝜃 − 𝛼∇ 𝑓𝑖 (𝜃)) .

5

Then, we derive the results via bounding the first and second depending on the variance of local meta-function, task simi-
moments of 𝛿1∗ and 𝛿2∗ . The detailed proof is presented in larities, smoothness, and learning rates. It therefore provides
Appendix B of the technical report [39]. a criterion for selecting users to accelerate the convergence.
˜ 𝑖 (𝜃) is From Theorem 1, we have the following Corollary, which
Lemma 2 shows that the stochastic gradient ∇𝐹
simplifies the above result and extends it to multi-step cases.
a biased estimate of ∇𝐹𝑖 (𝜃), revealing the challenges in
analyzing FML algorithms. Corollary 1. Suppose that Assumptions 1-4 are satisfied, and
D𝑖 , D𝑖0, and D𝑖00 are independent batches with 𝐷 𝑖 = 𝐷 𝑖0 =
Lemma 3. If Assumptions 1, 2, and 4 are satisfied, then for
𝐷 𝑖00. If the local update and global aggregation follow (2)
any 𝜃 ∈ R𝑑 and 𝑖, 𝑗 ∈ N , we have
and (6) respectively, then the following fact holds true with
∇𝐹𝑖 (𝜃) − ∇𝐹 𝑗 (𝜃) ≤ (1 + 𝛼𝐿) 2 𝛾𝐺 + 𝛼𝜁 𝛾 𝐻 .

(16) 𝛽 ∈ [0, 1/𝐿 𝐹 )
"
Sketch of Proof. We divide the bound of k∇𝐹𝑖 (𝜃) − ∇𝐹 𝑗 (𝜃) k   𝛽
𝜏−1
1 ∑︁ ∑︁ ˜ 2
into two independent terms, followed by bounding the two
𝑘 𝑘+1
E 𝐹 (𝜃 ) − 𝐹 (𝜃 ) ≥ E ∇𝐹𝑖 (𝜃 𝑖𝑘,𝑡 )
2 𝑛 𝑘 𝑖 ∈N 𝑡=0
terms separately. The detailed proof is presented in Appendix √︂ h
𝑘
!#
C of the technical report [39]. 𝜆2  2 i
−2 𝜆1 + √ E ∇𝐹˜ 𝑖 (𝜃 ) N𝑘
𝑘,𝑡
(18)
𝑖
Lemma 3 characterizes the similarities between the local 𝐷𝑖
meta-functions, which is critical for analyzing the one-step where positive constants 𝜆1 and 𝜆2 satisfy that
global loss reduction because it relates the local meta-function √︃ √︃
and global objective. Based on Lemmas 1–3, we are now ready 𝜆1 ≥ (1 + 𝛼𝐿) 2 𝛾𝐺 + 𝛼𝜁 𝛾 𝐻 + 𝛽𝜏 35(𝛾𝐺 2 + 2𝜎 2 ) (19)
𝐹
to give our main result.   
2 2 2 2 2
𝜆2 ≥ 6𝜎𝐺 1 + (𝛼𝐿) (𝛼𝜎𝐻 ) + (1 + 𝛼𝐿)
Theorem 1. Suppose that Assumptions 1-4 are satisfied, and
D𝑖 , D𝑖0, and D𝑖00 are independent batches. If the local update + 3(𝛼𝜁 𝜎𝐻 ) 2 . (20)
and global aggregation follow (2) and (6) respectively, then
the following fact holds true for 𝜏 = 1 Sketch of Proof. The proof is similar to Theorem 1, with
" additional tricks in bounding the product of ∇𝐹 (𝜃 𝑘 ) and
𝑘 𝑘+1 1 ∑︁  𝐿 𝐹 𝛽  ˜ 2 ˜ (𝜃 𝑘 ). The detailed proof is presented in Appendix I of the
∇𝐹
E[𝐹 (𝜃 ) − 𝐹 (𝜃 )] ≥ 𝛽E 1− ∇𝐹𝑖 (𝜃 𝑘 )
𝑛 𝑘 𝑖 ∈N 2 technical report [39].
𝑘
√︃ 
2 Corollary 1 implies that the device with a large gradient
− (1 + 𝛼𝐿) 𝛾𝐺 + 𝛼𝜁 𝛾 𝐻 + 𝜎𝐹𝑖 naturally accelerates the global loss decrease, but a small
√︂ h
2 i
!# dataset size degrades the process due to corresponding high
× E ∇𝐹 ˜ 𝑖 (𝜃 𝑘 ) N𝑘 (17) variance. Besides, as the device dissimilarities become large,
the lower bound (18) weakens.
where the outer expectation of RHS is taken with respect to Motivated by Corollary 1, we study the device selection in
the selected user set N𝑘 and data sample sizes, and the inner the following subsection.
expectation is only regarding data sample sizes.
Sketch of Proof. Using the smoothness condition of 𝐹𝑖 , we B. Device Selection
express the lower bound of loss reduction by To improve the convergence speed, we aim to maximize the
lower bound (18) on the one-round objective reduction. Based
E[𝐹 (𝜃 𝑘 ) − 𝐹 (𝜃 𝑘+1 )] ≥ E[𝐺 𝑘 ], on Corollary 1, we define the contribution 𝑢 𝑖𝑘 device 𝑖 to the
where 𝐺 𝑘 is defined as convergence in round 𝑘 as
!
𝜏−1
1 ∑︁
˜ 𝑖 (𝜃 𝑘,𝑡 ) 2 − 2 𝜆1 + √𝜆2 ∇𝐹
∑︁
𝐺 𝑘 B 𝛽∇𝐹 (𝜃 𝑘 ) > ˜ 𝑖 (𝜃 𝑘 ) 
∇𝐹 𝑢 𝑖𝑘 B ∇𝐹 ˜ 𝑖 (𝜃 𝑘,𝑡 ) (21)
𝑛 𝑘 𝑖 ∈N 𝑖 𝑖
𝑘 𝑡=0 𝐷𝑖
2
2
𝐿𝐹 𝛽 1 ˜ 𝑖 (𝜃 𝑘,𝑡 ) by its
where we replace the second moment of ∇𝐹
∑︁
− ˜ 𝑖 (𝜃 𝑘 ) .
∇𝐹 𝑖

2 𝑛 𝑘 𝑖 ∈N ˜ 𝑖 (𝜃 𝑘,𝑡 ) k 2 . Then, the device selection problem
sample value k ∇𝐹
𝑘 𝑖
in round 𝑘 can be formulated as
The key step to derive the desired result is providing a tight ∑︁
lower bound for the product of ∇𝐹 (𝜃 𝑘 ) and ∇𝐹 ˜ (𝜃 𝑘 ). The max 𝑧 𝑖 𝑢 𝑖𝑘
{𝑧𝑖 }
detailed proof is presented in Appendix G of the technical 𝑖 ∈N
∑︁
report [39]. s.t. 𝑧𝑖 = 𝑛 𝑘 (22)
𝑖 ∈N
Theorem 1 provides a lower bound on the one-round
𝑧𝑖 ∈ {0, 1}, ∀𝑖 ∈ N .
reduction of the global objective function 𝐹 based on the
device selection. It implies that different user selection has In (22), 𝑧𝑖 is a binary variable: 𝑧 𝑖 = 1 for selecting device
varying impacts on the objective improvement and quantifies 𝑖 in this round; 𝑧 𝑖 = 0, otherwise. The solution of (22) can
the contribution of each device to the objective improvement, be found by the Select algorithm introduced in [40, Chapter
6

9] with the worst-case complexity 𝑂 (𝑛) (the problem (22) is A. System Model
indeed a selection problem). Accordingly, our device selection
As illustrated in Fig. 1, we consider a wireless multi-user
scheme is presented as follows.
system, where a set N of 𝑛 end devices joint forces to carry out
Device selection: Following the local update phase, instead
federated meta-learning aided by an edge server. Each round
of selecting devices uniformly at random, each device 𝑖 first
consists of two stages: the computation phase and the commu-
computes its contribution scalar 𝑢 𝑖𝑘 locally and sends it to the
nication phase. In the computation phase, each device 𝑖 ∈ N
server. After receiving {𝑢 𝑖𝑘 }𝑖 ∈N from all devices, the server
downloads the current global model and computes the local
runs the Select algorithm and finds the optimal device set
model based on its local dataset; in the communication phase,
denoted by N𝑘∗ . Notably, although constants 𝜆1 and 𝜆2 in (22)
the selected devices transmit the local models to the edge
consist of unknown parameters such as 𝐿, 𝛾𝐺 , and 𝛾 𝐻 , they
server via a limited number of wireless channels. After that,
can be either estimated during training as in [29] or directly
the edge server runs the global aggregation and starts the next
tuned as in our simulations.
round. Here we do not consider the downlink communication
Based on the device selection scheme, we propose the due to the asymmetric uplink-downlink settings in wireless
Non-Uniform Federated Meta-Learning (NUFM) algorithm networks. That is, the transmission power at the server (e.g.,
(depicted in Algorithm 1). In particular, although NUFM a base station) and the downlink communication bandwidth
requires an additional communication phase to upload 𝑢 𝑖𝑘 to are generally sufficient for global meta-model transmission.
the server, the communication overhead can be negligible Thus, the downlink time is usually neglected, compared to
because 𝑢 𝑖𝑘 is just a scalar. the uplink data transmission time [15]. Since we focus on
the device selection and resource allocation problem in each
Algorithm 1: Non-Uniform Federated Meta-Learning
round, we omit subscript 𝑘 for brevity throughout this section.
(NUFM)
Input: 𝛼, 𝛽, 𝜆1 , 𝜆2
0
1 Server initializes model 𝜃 and sends it to all devices; Aggregation
2 for round 𝑘 = 0 to 𝐾 − 1 do
3 foreach device 𝑖 ∈ N do
// Local update
4 Initialize 𝜃 𝑖𝑘,0 ← 𝜃 𝑘 and 𝑢 𝑖𝑘 ← 0; Edge server
Uplink transmission
5 for local step 𝑡 = 0 to 𝜏 − 1 do with limited RBs Global
6 Compute stochastic gradient ∇𝐹 ˜ 𝑖 (𝜃 𝑘,𝑡 ) by Broadcasting meta-model
𝑖
(3) using batches D𝑖 , D𝑖 , and D𝑖00;
0 Local
update
7 Update local model 𝜃 𝑖𝑘,𝑡+1 by (2);
8 Update contribution scalar 𝑢 𝑖𝑘 by
˜ 𝑖 (𝜃 𝑘,𝑡 ) k 2
𝑢 𝑖𝑘 ← 𝑢 𝑖𝑘 + k ∇𝐹 𝑖
− 2(𝜆1 + √𝜆2 ) k ∇𝐹 ˜ 𝑖 (𝜃 𝑘,𝑡 ) k;
𝐷𝑖 𝑖 Data Local meta-model User devices
9 end Fig. 1. The architecture of federated meta-learning over a wireless network
10 Set 𝜃 𝑖𝑘 = 𝜃 𝑖𝑘, 𝜏 and send 𝑢 𝑖𝑘 to server; with multiple user devices and an edge server. Due to limited communication
11 end resources, only part of user devices can upload their local models in each
training round.
// Device selection
12 Once receiving {𝑢 𝑖𝑘 }𝑖 ∈N , server computes optimal 1) Computation Model: We denote 𝑐 𝑖 as the CPU cycles
device selection N𝑘∗ by solving (22); for device 𝑖 to update the model with one sample, which can
// Global aggregation be measured offline as a priori knowledge [15]. Assume that
13 After receiving local models {𝜃 𝑖𝑘 }𝑖 ∈N𝑘∗ , server the batch size of device 𝑖 used in local update phase (2) is
computes the global model by (6); 𝐷 𝑖 . Then, the number of CPU cycles required for device 𝑖 to
14 end run a one-step local update is 𝑐 𝑖 𝐷 𝑖 . We denote the CPU-cycle
15 return 𝜃 𝐾 frequency of device 𝑖 as 𝜈𝑖 . Thus, the CPU energy consumption
of device 𝑖 in the computation during the local update phase
can be expressed by
V. F EDERATED M ETA -L EARNING OVER W IRELESS cp 𝜄𝑖
𝐸 𝑖 (𝜈𝑖 ) B 𝜏𝑖 𝑐 𝑖 𝐷 𝑖 𝜈𝑖2 (23)
N ETWORKS 2
In this section, we extend NUFM to the context of multi- where 𝜄𝑖 /2 is the effective capacitance coefficient of the
access wireless systems, where the bandwidth for uplink computing chipset of device 𝑖 [41]. The computational time
transmission and the power of IoT devices are limited. First, of device 𝑖 in a round can be denoted as
we present the system model followed by the problem formu- 𝜏𝑖 𝑐 𝑖 𝐷 𝑖
cp
lation. Then, we decompose the original problem into two 𝑇𝑖 (𝜈𝑖 ) B . (24)
𝜈𝑖
sub-problems and devise solutions for each of them with
theoretical performance guarantees. For simplicity, we set 𝜏 = 1 in the following.
7

2) Communication Model: We consider a multi-access We consider the following non-convex mixed-integer non-
protocol for devices, i.e., the orthogonal frequency division linear programming (MINLP) problem
multiple access (OFDMA) technique whereby each device can
(P) max 𝑈 (𝒛) − 𝜂1 𝐸 (𝒛, 𝒑, 𝝂) − 𝜂2𝑇 (𝒛, 𝒑, 𝝂)
occupy one uplink resource block (RB) in a communication 𝒛,𝒑,𝝂
round to upload its local model. There are 𝑀 RBs in the s.t. 0 ≤ 𝑝 𝑖 ≤ 𝑝 max
𝑖 , ∀𝑖 ∈ N (33)
system, denoted by M = {1, 2, . . . , 𝑀 }. The achievable
0 ≤ 𝜈𝑖 ≤ 𝜈𝑖max , ∀𝑖 ∈ N (34)
transmission rate of device 𝑖 is [25]
  𝑧 𝑖,𝑚 ∈ {0, 1}, ∀𝑖 ∈ N , 𝑚 ∈ M (35)
∑︁ ℎ𝑖 𝑝 𝑖 ∑︁
𝑟 𝑖 (𝒛 𝑖 , 𝑝 𝑖 ) B 𝑧 𝑖,𝑚 𝐵 log2 1 + (25) 𝑧𝑖,𝑚 ≤ 1, ∀𝑚 ∈ M (26)
𝐼𝑚 + 𝐵𝑁0 ∑︁𝑖 ∈N
𝑚∈M
𝑧 𝑖,𝑚 ≤ 1, ∀𝑖 ∈ N (27)
with 𝐵 being the bandwidth of each RB, ℎ𝑖 the channel gain, 𝑚∈M
𝑁0 the noise power spectral density, 𝑝 𝑖 the transmission power where 𝜂1 ≥ 0 and 𝜂2 ≥ 0 are weight parameters to capture
of device 𝑖, and 𝐼𝑚 the interference caused by the devices that the Pareto-optimal trade-offs among convergence, latency, and
are located in other service areas and use the same RB. In energy consumption, the values of which depend on specific
(25), 𝑧𝑖,𝑚 ∈ {0, 1} is a binary variable associated with the 𝑚- scenarios. Constraints (33) and (34) give the feasible regions of
th RB allocation for device 𝑖: 𝑧 𝑖,𝑚 = 1 indicates that RB 𝑚 devices’ transmission power levels and CPU-cycle frequencies,
is allocated to device 𝑖, and 𝑧𝑖,𝑚 = 0 otherwise. Each device respectively. Constraints (26) and (27) restrict that each device
can only occupy one RB at maximum while each RB can be can only access one uplink RB while each RB can be allocated
accessed by at most one device, thereby we have to one device at most.
In this formulation, we aim to maximize the convergence
∑︁
𝑧𝑖,𝑚 ≤ 1 ∀𝑖 ∈ N (26)
𝑚∈M
speed of FML, while minimizing the energy consumption and
∑︁ wall-clock time in each round. Notably, our solution can adapt
𝑧𝑖,𝑚 ≤ 1 ∀𝑚 ∈ M. (27) to the problem with hard constraints on energy consumption
𝑖 ∈N
and wall-clock time as in [14] via setting “virtual devices”
Due to the fixed dimension of model parameters, we assume (see Lemma 4 and Lemma 6).
that the model sizes of devices are constant throughout the Next, we provide a joint device selection and resource
learning process, denoted by 𝑆. If device 𝑖 is selected, the allocation algorithm to solve this problem.
time duration of transmitting the model is given by
𝑆 C. A Joint Device Selection and Resource Allocation Algo-
𝑇𝑖co (𝒛𝑖 , 𝑝 𝑖 ) B (28)
𝑟 𝑖 (𝒛𝑖 , 𝑝 𝑖 ) rithm
where 𝒛𝑖 B {𝑧𝑖,𝑚 | 𝑚 ∈ M}. Besides, the energy consumption Substituting (30), (31), and (32) into problem (P), we can
of the transmission is easily decompose the original problem into the following two
∑︁ sub-problems (SP1) and (SP2).
𝐸 𝑖co (𝒛𝑖 , 𝑝 𝑖 ) B 𝑧𝑖,𝑚𝑇𝑖co (𝒛 𝑖 , 𝑝 𝑖 ) 𝑝 𝑖 . (29)
∑︁ 𝜄𝑖 𝑐𝑖 𝐷 𝑖
𝑚∈M
(SP1) min 𝑔1 (𝝂) = 𝜂1 𝑐 𝑖 𝐷 𝑖 𝜈𝑖2 + 𝜂2 max
If no RB is allocated to device 𝑖 in current round, its trans-
𝝂
𝑖 ∈N
2 𝑖 ∈N 𝜈𝑖
mission power and energy consumption is zero. s.t. 0 ≤ 𝜈𝑖 ≤ 𝜈𝑖max , ∀𝑖 ∈ N .
∑︁ ∑︁
(SP2) max 𝑔2 (𝒛, 𝒑) = 𝑧𝑖,𝑚 𝑢 𝑖
𝒛,𝒑
B. Problem Formulation 𝑖 ∈N 𝑚∈M
For ease of exposition, we define 𝝂 B {𝜈𝑖 | 𝑖 ∈ N }, 𝒑 B 𝑆 𝑝𝑖
∑︁ ∑︁
− 𝜂1  
{𝑝 𝑖 | 𝑖 ∈ N }, and 𝒛 B {𝑧𝑖,𝑚 | 𝑖 ∈ N , 𝑚 ∈ M}. Recall the 𝑖 ∈N 𝑚∈M 𝐵 log2 1 + ℎ𝑖 𝑝𝑖
𝐼𝑚 +𝐵 𝑁0
procedure of NUFM. The total energy consumption 𝐸 (𝒛, 𝒑, 𝝂)
and wall-clock time 𝑇 (𝒛, 𝒑, 𝝂) in a round can be expressed by
∑︁ 𝑆
− 𝜂2 max 𝑧𝑖,𝑚  
𝑖 ∈N ℎ𝑖 𝑝𝑖
∑︁ 
cp
 𝑚∈M 𝐵 log2 1 +
𝐸 (𝒛, 𝒑, 𝝂) B 𝐸 𝑖 (𝜈𝑖 ) + 𝐸 𝑖co (𝒛𝑖 , 𝑝 𝑖 ) (30) 𝐼𝑚 +𝐵 𝑁0

𝑖 ∈I s.t. 0 ≤ 𝑝 𝑖 ≤ 𝑝 max
𝑖 , ∀𝑖 ∈ N
∑︁
cp
∑︁
𝑇 (𝒛, 𝒑, 𝝂) B max 𝑇𝑖 (𝜈𝑖 ) + max 𝑧𝑖,𝑚𝑇𝑖co (𝒛 𝑖 , 𝑝 𝑖 ) (31) 𝑧𝑖,𝑚 ≤ 1, ∀𝑚 ∈ M
𝑖 ∈N 𝑖 ∈N
𝑚∈M ∑︁𝑖 ∈N
𝑧𝑖,𝑚 ≤ 1, ∀𝑖 ∈ N
where we neglect the communication time for transmitting the 𝑚∈M
scalar 𝑢 𝑖 . The total contribution to the convergence is 𝑧𝑖,𝑚 ∈ {0, 1}, ∀𝑖 ∈ N , 𝑚 ∈ M.
∑︁ ∑︁
𝑈 (𝒛) = 𝑧 𝑖,𝑚 𝑢 𝑖 (32) (SP1) aims at controlling the CPU-cycle frequencies for de-
𝑖 ∈N 𝑚∈M vices to minimize the energy consumption and latency in the
computational phase. (SP2) controls the transmission power
where 𝑢 𝑖 is given in (21)3 .
and RB allocation to maximize the convergence speed while
3 One can regularize 𝑢 via adding a large enough constant in (21) to keep
𝑖
minimizing the transmission cost and communication delay.
it positive. We provide the solutions to these two sub-problems separately.
8

1) Solution to (SP1): Denote the optimal CPU-cycle fre- We begin with analyzing the properties of the optimal
quencies of (SP1) as 𝝂 ∗ = {𝜈𝑖∗ }𝑖 ∈N . We first give the following solution in the next lemma.
lemma to offer insights into this sub-problem.
Lemma 5. Denote the transmission delay regarding 𝒛∗ and
Lemma 4. If device 𝑗 is the straggler of all devices with 𝒑 ∗ as 𝛿∗ , i.e.,
optimal CPU frequencies 𝝂 ∗ , i.e., 𝑗 = arg max𝑖 ∈N 𝑐 𝑖 𝐷 𝑖 /𝜈𝑖∗ , ∑︁ 𝑆
the following holds true for any 𝜂1 , 𝜂2 > 0 𝛿∗ B max 𝑧∗𝑖,𝑚 . (40)
ℎ𝑖 𝑝𝑖∗

𝑖 ∈N
𝑚∈M 𝐵 log2 1 + 𝐼𝑚 +𝐵𝑁
𝑐 𝑗 𝐷 𝑗 𝜈 max 0
n√︃ o
3 𝑎2 𝑗
min min , if 𝑖 = 𝑗


2𝑎1 , 𝑖 ∈N
The following relation holds


𝜈𝑖∗ = 𝑐𝑖 𝐷𝑖 𝜈 ∗
𝑐𝑖 𝐷𝑖
(36)
𝑗

 𝑐𝐷 , otherwise. 𝑆
 𝑗 𝑗 𝛿∗ ≥ (41)
ℎ𝑖 𝑝𝑖max
 
The positive constants 𝑎 1 and 𝑎 2 in (36) are defined as 𝐵 log2 1 + 𝐼𝑚 +𝐵 𝑁0
∑︁ 𝜄𝑖 (𝑐 𝑖 𝐷 𝑖 ) 3 and 𝑝 ∗𝑖 can be expressed by
𝑎 1 B 𝜂1 (37)
2(𝑐 𝑗 𝐷 𝑗 ) 2 𝑆
𝑖 ∈N  (𝐼𝑚𝑖∗ +𝐵𝑁0 ) (2 𝐵 𝛿∗ −1)


, if 𝑖 ∈ N ∗

𝑎 2 B 𝜂2 𝑐 𝑗 𝐷 𝑗 . (38) 𝑝 ∗𝑖 = ℎ𝑖 (42)
0

, otherwise.
Sketch of Proof. The derivation of the results involves two

steps, i.e., expressing 𝑣 ∗𝑖 by 𝑣 ∗𝑗 and deriving 𝑣 ∗𝑗 by solving Sketch of Proof. We prove (41) by contradiction and (42) by
the corresponding optimization problem. The detailed proof is solving the corresponding transformed problem. The detailed
presented in Appendix D of the technical report [39]. proof is presented in Appendix E of the technical report [39].

Lemma 4 implies that if the straggler (a device with Lemma 5 indicates that the optimal transmission power can
the lowest computational time) can be determined, then the be derived as a closed-form via (42), given the RB allocation
optimal CPU-cycle frequencies of all devices can be derived and transmission delay. Lemma 5 also implies that for any RB
as closed-form solutions. Intuitively, due to the contradiction in allocation strategy 𝒛˜ and transmission delay 𝛿˜ (not necessarily
minimizing the energy consumption and computational time, optimal), equation (42) provides the “optimal” transmission
if the straggler is fixed, then the other devices can use the power under 𝒛˜ and 𝛿˜ as long as (41) is satisfied. Based on
smallest CPU-cycle frequencies as long as the computational that, we have the following result.
time is shorter than that of the straggler. It leads to the 𝑆
following Theorem. Theorem 3. Denote 𝜇𝑖,𝑚 B (𝐼𝑚 + 𝐵𝑁0 ) (2 𝐵 𝛿∗ − 1)/ℎ𝑖 . Given
∗ transmission delay 𝛿∗ , the optimal RB allocation strategy can
Theorem 2. Denote 𝝂straggler: 𝑗 as the optimal solution (i.e., be obtained by
(36)) under the assumption that 𝑗 is the straggler. Then, the  ∑︁ 
global optimal solution of (SP1) can be obtained by ∗
𝒛 = arg max

𝑧 𝑖,𝑚 𝑢 𝑖 − 𝑒 𝑖,𝑚 , s.t. (26) − (35) (43)
𝒛
𝝂 ∗ = arg min 𝑔1 (𝝂) (39) 𝑖,𝑚
𝝂 ∈V where

(
where V B {𝝂straggler: 𝑗 } 𝑗 ∈N , and 𝑔1 is the objective function 𝜂1 𝛿∗ 𝜇𝑖,𝑚 , if 𝜇𝑖,𝑚 ≤ 𝑝 max
in (SP1). 𝑒 𝑖,𝑚 B 𝑖 (44)
𝑢𝑖 + 1 , otherwise.
Proof. The result can be directly obtained from Lemma 4. We Proof. From Lemma 5, Eq. (43) holds if 𝜇𝑖,𝑚 ≤ 𝑝 max 𝑖 . On
omit it for brevity. the other hand, when 𝜇𝑖,𝑚 > 𝑝 max , if device 𝑖 is selected,
𝑖
Theorem 2 shows that the optimal solution of (SP1) is the transmission delay will be larger than 𝛿∗ (see the proof
the fixed-straggler solution in Lemma 4 corresponding to of Lemma 5), which is contradictory to the given condition.
the minimum objective 𝑔1 . Thus, (SP1) can be solved with Thus, when 𝜇𝑖,𝑚 > 𝑝 max
𝑖 , we set 𝑒 𝑖,𝑚 = 𝑢 𝑖 + 1, ensuring device
computational complexity 𝑂 (𝑛) by comparing the achievable 𝑖 not to be selected.
objective values corresponding to different stragglers. Theorem 3 shows the optimal RB allocation strategy can
2) Solution to (SP2): Similar to Section V-C1, we denote be obtained by solving (43), given transmission delay 𝛿∗ .
the optimal solutions of (SP2) as 𝒛∗ = {𝑧 ∗𝑖,𝑚 | 𝑖 ∈ N , 𝑚 ∈ Naturally, problem (43) can be equivalently transformed to
M} and 𝒑 ∗ = {𝑝 ∗𝑖 | 𝑖 ∈ N } respectively, N ∗ B {𝑖 ∈ N | a bipartite matching problem. Consider a Bipartite Graph G
Í ∗
𝑚∈N 𝑧 𝑖,𝑚 = 1} as the optimal set of selected devices and, with source set N and destination set M. For each 𝑖 ∈ N
for each 𝑖 ∈ N ∗ , RB block allocated to 𝑖 as 𝑚 𝑖∗ , i.e., 𝑧 ∗𝑖,𝑚∗ = 1. and 𝑚 ∈ M, denote the weight of the edge from node 𝑖 to
𝑖
It is challenging to derive a closed-form solution for node 𝑗 as 𝑤 𝑖→ 𝑗 : If 𝑢 𝑖 − 𝑒 𝑖,𝑚 > 0, 𝑤 𝑖→ 𝑗 = 𝑒 𝑖,𝑚 − 𝑢 𝑖 ; otherwise,
(SP2) because it is a non-convex MINLP problem with non- 𝑤 𝑖→ 𝑗 = ∞. Therefore, maximizing (43) is equivalent to finding
differentiable “max” operator in the objective. Thus, in the a matching in G with the minimum sum of weights. It means
following, we develop an iterative algorithm to solve this that we can obtain the optimal RB allocation strategy under
problem and show that the algorithm will converge to a local fixed transmission delay via Kuhn-Munkres algorithm with the
minimum. worst complexity of 𝑂 (𝑀𝑛2 ) [42].
9

We proceed to show how to iteratively approximate the Algorithm 2: Iterative Solution (IVES)
optimal 𝛿∗ , 𝒑 ∗ , and 𝒛∗ . Input: 𝜂1 , 𝜂2 , 𝑆, 𝐵, {ℎ𝑖 }, {𝐼𝑚 }
Lemma 6. Let 𝑗 denote the communication straggler among 1 Initialize 𝑡 = 0 and 𝛿0 by (50);

all selected devices with respect to RB allocation 𝒛 ∗ and 2 while not done do

transmission power 𝒑 ∗ , i.e., for any 𝑖 ∈ N ∗ , 3 Compute RB allocation stratege 𝒛 𝑡 under 𝛿 𝑡 using
Kuhn-Munkres algorithm based on (43);
∑︁ 𝑆
𝑧 ∗𝑖,𝑚 4 Compute transmission power 𝒑 𝑡 by (49);
ℎ𝑖 𝑝𝑖∗
 
𝑚∈M 𝐵 log2 1 + 𝐼𝑚 +𝐵 𝑁0
5 Update 𝛿 𝑡+1 = 𝑇 co (𝒛 𝑡 , 𝒑 𝑡 );
| {z } 6 end
𝑇𝑖co (𝒛𝑖∗ , 𝑝𝑖∗ ) ∗
7 return 𝒛 , 𝒑

∑︁ 𝑆
≤ 𝑧 ∗𝑗,𝑚  ℎ 𝑗 𝑝 ∗𝑗
. (45)
𝑚∈M 𝐵 log2 1 + 𝐼𝑚 +𝐵 𝑁 Algorithm 3: User Selection and Resource Allocation
0
| {z } (URAL) Algorithm
𝑇 𝑗co (𝒛 ∗𝑗 , 𝑝 ∗𝑗 ) Input: 𝜂1 , 𝜂2 , 𝑆, 𝐵, {ℎ𝑖 }, {𝐼𝑚 } {𝜄𝑖 }, {𝑐 𝑖 }, {𝐷 𝑖 }

1 Compute 𝝂 by (39); // Solve (SP1)
Then, the following holds true: ∗ ∗
 2 Compute 𝒛 and 𝒑 by IVES; // Solve (SP2)
1) Define function 𝑓4 ( 𝑝) B 𝑏 1 (1+ 𝑝) log2 (1+ 𝑝) ln 2− 𝑝 −𝜂2 . ∗ ∗ ∗
3 return 𝝂 , 𝒛 , 𝒑
Then, 𝑓4 ( 𝑝) is monotonically increasing with respect to
𝑝 ≥ 0, and has unique zero point 𝑝˜0𝑗 ∈ (0, 𝑏 2 ], where 𝑏 1
and 𝑏 2 are denoted as follows
∑︁ 𝐼𝑚∗ + 𝐵𝑁0 Theorem 3 and 4, we have the following Iterative Solution
𝑖
𝑏 1 B 𝜂1 (46) (IVES) algorithm to solve (SP2).
∗ ℎ𝑖
𝑖 ∈N
√︃ IVES: We initialize transmission delay 𝛿0 (based on (42))
𝜂
(1+ max{ 𝑏2 ,1}−1)/ln 2 as follows
𝑏2 B 2 1 ; (47)
2) Denote (SNR) 𝑗 B (𝐼𝑚∗𝑗 + 𝐵𝑁0 )/ℎ 𝑗 . For 𝑖 ∈ we have N ∗, 𝛿0 = max
𝑆
. (50)
ℎ𝑖 𝑝𝑖max

𝑖 ∈N
  𝐵 log2 1 + 𝐼𝑚 +𝐵 𝑁0
𝑝𝑖max
 min 𝑝˜0𝑗 , min𝑖 ∈N ∗ 𝐼 ℎ∗𝑖 +𝐵



 𝑁0 , if 𝑖 = 𝑗
𝑝𝑖 = 𝑚
𝑖 (48)
 (SNR)𝑖 𝑝 ∗ In each iteration 𝑡, we first compute an RB allocation strategy
otherwise.

 (SNR) 𝑗 𝑗 ,
𝒛 𝑡 via solving (43) by the Kuhn-Munkres algorithm. Then,
Sketch of Proof. We obtain the first result by analyzing the based on 𝒛 𝑡 , we find the corresponding transmission power
property of 𝑓4 and derive (48) via solving the corresponding 𝒑 𝑡 by (49) and update the transmission delay by 𝛿 𝑡+1 =
optimization problem. The detailed proof is presented in 𝑇 co (𝒛 𝑡 , 𝒑 𝑡 ) before the next iteration. The details of IVES are
Appendix F of the technical report [39]. depicted in Algorithm 2.
Using IVES, we can solve (SP2) in an iterative manner. In
Lemma 6 indicates that, given optimal RB allocation strat-
the following theorem, we provide the convergence guarantee
egy 𝒛∗ and straggler, the optimal transmission power can be
for IVES.
derived by (48), different from Lemma 5 that requires the
corresponding transmission delay 𝛿∗ . Notably, in (48), we can Theorem 5. If we solve (SP2) by IVES, then {𝑔2 (𝒛 𝑡 , 𝒑 𝑡 )}
obtain zero point 𝑝˜0𝑗 of 𝑓4 with any required tolerance 𝜖 by monotonically increases and converges to a unique point.
Bisection method in log2 ( 𝑏𝜖2 ) iterations at most.
Similar to Theorem 2, we can find the optimal transmission Sketch of Proof. The result is derived via proving 𝑔(𝒛 𝑡 , 𝒑 𝑡 ) ≤
power by the following theorem, given the RB allocation. 𝑔(𝒛 𝑡+1 , 𝒑ˆ 𝑡+1 ) and 𝑔(𝒛 𝑡+1 , 𝒑ˆ 𝑡+1 ) ≤ 𝑔(𝒛 𝑡+1 , 𝒑 𝑡+1 ). The detailed
proof is presented in Appendix H of the technical report [39].
Theorem 4. Denote 𝒑 ∗straggler: 𝑗 as the optimal solution under
the assumption that 𝑗 is the communication straggler, given
fixed RB allocation 𝒛∗ . The corresponding optimal transmis-
Although IVES solves (SP2) iteratively, we observe that it
sion power is given by
can converge extremely fast (often with only two iterations)
𝒑 ∗ = arg max 𝑔2 (𝒛∗ , 𝒑) (49) in the simulation, achieving a low computation complexity.
𝒑∈P
Combining the solutions of (SP1) and (SP2), we provide
where P B { 𝒑 ∗straggler: 𝑗 } 𝑗 ∈N and 𝑔2 is the objective function the User selection and Resource Allocation (URAL) in Al-
defined in (SP2). gorithm 3 to solve the original problem (P). The URAL
can simultaneously optimize the convergence speed, training
Proof. We can easily obtain the result from Lemma 6, thereby
time, and energy consumption via jointly selecting devices and
omitting it for brevity.
allocating resources. Further, URAL can be directly integrated
Define the communicationÍ time corresponding to 𝒛 and into the NUFM paradigm in the device selection phase to
𝒑 as 𝑇 co (𝒛, 𝒑) B max𝑖 ∈N 𝑚∈M 𝑧𝑖,𝑚𝑇𝑖co (𝒛 𝑖 , 𝑝 𝑖 ). Based on facilitate the deployment of FML in wireless networks.
10

VI. E XTENSION TO F IRST-O RDER A PPROXIMATIONS A. Experimental Setup


Due to the computation of Hessian in local update (2), it 1) Datasets and Models: We evaluate our algorithms on
may cause high computational cost for resource-limited IoT four widely-used benchmarks, namely Fashion-MNIST [43],
devices. In this section, we address this challenge. CIFAR-10 [44], CIFAR-100 [44], and ImageNet [45]. Specifi-
There are two common methods used in the literature to cally, the data is distributed among 𝑛 = 100 devices as follows:
reduce the complexity in computing Hessian [9]: a) Each device has samples from two random classes; b) the
number of samples per class follows a truncated Gaussian
1) replacing the stochastic gradient by
distribution N (𝜇, 𝜎 2 ) with 𝜇 = 5 and 𝜎 = 5. We select 50%
˜ 𝑖 (𝜃) ≈ ∇˜ 𝑓𝑖 𝜃 − 𝛼 ∇˜ 𝑓𝑖 (𝜃, D𝑖 ), D𝑖0 ;
∇𝐹

(51) devices at random for training with the rest for testing. For
each device, we divide the local dataset into a support set and
2) replacing the Hessian-gradient product by a query set. We consider 1-shot 2-class classification tasks, i.e.,
the support set contains only 1 labeled example for each class.
∇˜ 2 𝑓𝑖 (𝜃, D𝑖00) ∇˜ 𝑓𝑖 𝜃 − 𝛼 ∇˜ 𝑓𝑖 (𝜃, D𝑖 ), D𝑖0

We set the stepsizes as 𝛼 = 𝛽 = 0.001. We use a convolutional
∇˜ 𝑓𝑖 𝜃 + 𝜖 𝑔˜ 𝑖 , D𝑖00 − ∇˜ 𝑓𝑖 𝜃 − 𝜖 𝑔˜ 𝑖 , D𝑖00
 
neural network (CNN) with max-pooling operations and the
≈ (52) Leaky Rectified Linear Unit (Leaky ReLU) activation function,
2𝜖
containing three convolutional layers with sizes 32, 64, and
where 𝑔˜ 𝑖 = ∇˜ 𝑓𝑖 𝜃 − 𝛼 ∇˜ 𝑓𝑖 (𝜃, D𝑖 ), D𝑖0 .

128 respectively, followed by a fully connected layer and
By doing so, the computational complexity of a one-step a softmax layer. The strides are set as 1 for convolution
local update can be reduced from 𝑂 (𝑑 2 ) to 𝑂 (𝑑), while not operation and 2 for the pooling operation.
sacrificing too much learning performance. Next, we show that 2) Baselines: To compare the performance of NUFM, we
our results in Theorem 1 hold in the above two cases. first consider two existing algorithms, i.e., FedAvg [4] and Per-
FedAvg [9]. Further, to validate the effectiveness of URAL in
Corollary 2. Suppose that Assumptions 1-4 are satisfied, and
multi-access wireless networks, we use two baselines, called
D𝑖 , D𝑖0, and D𝑖00 are independent batches. If the local update
Greedy and Random. In each round, the Greedy strategy de-
and global aggregation follow (2) and (6) respectively, we g
termines the CPU-cycle frequency 𝜈𝑖 and transmission power
have for 𝜏 = 1 g
" 𝑝 𝑖 for device 𝑖 ∈ N by greedily minimizing its individual
1 ∑︁  𝐿 𝐹 𝛽  ˜ 2 objective, i.e.,
𝑘 𝑘+1
E[𝐹 (𝜃 ) − 𝐹 (𝜃 )] ≥ 𝛽E 1− ∇𝐹𝑖 (𝜃 𝑘 ) ( )
𝑛 𝑘 𝑖 ∈N 2 g 𝜄𝑖 𝑐 𝑖 𝐷 𝑖 𝜈𝑖2 𝑐𝑖 𝐷 𝑖
𝑘
𝜈𝑖 = arg min 𝜂1 + 𝜂2 , s.t. (34) (55)
√︃  𝜈𝑖 2 𝜈𝑖
− (1 + 𝛼𝐿) 2 𝛾𝐺 + 𝛼𝜁 𝛾 𝐻 + 𝜎 ˜ 𝐹𝑖
√︂ h !#  ∑︁ 𝑧g (𝜂1 𝑝 𝑖 + 𝜂2 )𝑆
 

g

 𝑖,𝑚 


˜
2 i 𝑝 𝑖 = arg min  , s.t. (33) (56)
× E ∇𝐹𝑖 (𝜃 ) N𝑘
𝑘 (53)

 𝑚∈M 𝐵 log2 1+ℎ𝑖 𝑝𝑖

𝑝𝑖  

 𝐼𝑚 +𝐵 𝑁0 
g
where 𝜎˜ 𝐹𝑖 is defined as follows where {𝑧𝑖,𝑚 } 𝑚∈M is selected at random (i.e., randomly allo-
  cating RBs to selected devices). The Random strategy decides
2 1 ( 𝛼𝐿) 2


 4𝜎 𝐺 𝐷 0 + 𝐷 𝑖
+ 2(𝛼𝐿𝜁) 2 , if using (51) the CPU-cycle frequencies, transmission powers, and RB


  𝑖 2   allocation for the selected devices uniformly at random from
 6𝜎 2
 𝛼 2
𝜎 2
˜ 𝐹𝑖 =

𝐺 2 00
+ 1 + 2(𝛼𝐿) the feasible regions.
  𝜖 𝐷𝑖 2   , if using (52). 3) Implementation: We implement the code in TensorFlow
1 (𝛼𝐿)


+ 2(𝛼𝜌𝜖) 2 𝜁 4 Version 1.14 on a server with two Intelr Xeonr Golden 5120

 · 𝐷0 + 𝐷


 𝑖 𝑖
CPUs and one Nvidiar Tesla-V100 32G GPU. The parameters
(54)
used in the simulation can be found in Table III.
Proof. The detailed proof is presented in Appendix I of the
technical report [39]. B. Experimental Results
1) Convergence Speed: To demonstrate the improvement of
Corollary 2 indicates that NUFM can be directly combined
NUFM on the convergence speed, we compare the algorithms
with the first-order approximation techniques to reduce com-
on different benchmarks with the same initial model and
putational cost. Further, similar to Corollary 1, Corollary 2 can
learning rate. We vary the number of participated devices 𝑛 𝑘
be extended to the multi-step cases.
from 20 to 40, and set the numbers of local update steps
and total communication rounds as 𝜏 = 1 and 𝐾 = 50,
VII. S IMULATION respectively. We let 𝜆1 = 𝜆2 = 1. As illustrated in Fig. 2 and
This section evaluates the performance of our proposed Table VII-B1, NUFM significantly improve the convergence
algorithms by comparing with existing baselines in real-world speed and corresponding test accuracy of the existing FML
datasets. We first present the experimental setup, including approach on all datasets4 . Clearly, it validates the effectiveness
the datasets, models, parameters, baselines, and environment. 4 To make the graphs more legible, we draw symbols every two points in
Then we provide our results from various aspects. Fig. 2.
11

N U F M P e r-F e d A v g F e d A v g

F a s h io n - M N IS T C IF A R -1 0 C IF A R -1 0 0 Im a g e N e t
4 5 9 1 6

4 1 2
3
6
3
8
2 2
3 4
1
1
0 1 0 2 0 3 0 4 0 5 0 0 1 0 2 0 3 0 4 0 5 0 0 1 0 2 0 3 0 4 0 5 0 0 1 0 2 0 3 0 4 0 5 0
4 5 9 1 6

4 1 2
3
6
3
8
2 2
4
3
1
1
0 1 0 2 0 3 0 4 0 5 0 0 1 0 2 0 3 0 4 0 5 0 0 1 0 2 0 3 0 4 0 5 0 0 1 0 2 0 3 0 4 0 5 0
4 5 9 1 6

4 1 2
3
6
3
8
2 2
4
3
1
1
0 1 0 2 0 3 0 4 0 5 0 0 1 0 2 0 3 0 4 0 5 0 0 1 0 2 0 3 0 4 0 5 0 0 1 0 2 0 3 0 4 0 5 0
R o u n d R o u n d R o u n d R o u n d

Fig. 2. Comparison of convergence rates under different numbers of participated devices. NUFM significantly accelerates the convergence of the existing
FML approach, especially with fewer participated devices. In addition, with more participating devices, the advantages of NUFM weaken, leading to smaller
gaps between NUFM and the existing algorithms.

of our proposed device selection scheme that maximizes the compare the loss under different numbers of local steps in the
lower bound of one-round global loss reduction. Interestingly, round 19 on Fashion-MNIST and CIFAR-100. Fig. 3 shows
Fig. 2 also indicates that NUFM converges more quickly with that fewer local update steps lead to a larger gap between
relatively fewer participated devices. For example, in round the baselines and NUFM, which verifies the theoretical result
19, the loss achieved by NUFM with 𝑛 𝑘 = 20 decreases by that a small number of local steps can slow the convergence
more than 9% and 20% over those with 𝑛 𝑘 = 30 and 𝑛 𝑘 = 40 of FedAvg and Per-FedAvg [9, Theorem 4.5]. It also implies
on Fashion-MNIST, respectively. The underlying rationale is that NUFM can improve the computational efficiency of local
that relatively fewer “good” devices can provide a larger lower devices.
bound on the one-round global loss decrease (note that in (17)
N U F M P e r-F e d A v g F e d A v g
the lower bound takes the average of the selected devices).
F a s h io n - M N IS T C IF A R -1 0 0
2 .0 5
More selected devices in each round generally require more
communication resources. Thus, the results reveal the potential
of NUFM in applications to resource-limited wireless systems.
4
T r a in in g L o s s

1 .5
TABLE II
3
T EST ACCURACY AFTER FIFTY ROUNDS OF TRAINING .
Algorithm Fashion-MNIST CIFAR-10 CIFAR-100 ImageNet 1 .0
NUFM 68.04% 58.80% 23.95% 34.04% 2
Per-FedAvg 62.75% 58.22% 21.49% 30.98% 1 4 7 1 0 1 4 7 1 0
FedAvg 61.04% 54.31% 10.13% 12.14% # L o c a l S te p s # L o c a l S te p s

2) Effect of Local Update Steps: To show the effect of Fig. 3. Effect of local update steps on convergence rates. Fewer local steps
lead to larger gaps between NUFM and the existing methods.
local update steps on the convergence rate of NUFM, we
present results with varying numbers of local update steps 3) Performance of URAL in Wireless Networks: We evalu-
𝜏 = 1, 2, . . . , 10 in each round. For clarity of illustration, we ate the performance of URAL by comparing with four base-
12

lines, namely NUFM-Greedy, NUFM-Random, RU-Greedy, R U -R a n d o m R U -G re e d y N U F M -R a n d o m


and RU-Random, as detailed below. N U F M -G re e d y U R A L
a) NUFM-Greedy: select devices by NUFM, decide CPU- 4 4 0
cycle frequencies, RB allocation, and transmission power 5 5 0

E n e r g y C o n s u m p tio n
3 5 0
3 3 0

W a ll- C lo c k T im e
by Greedy strategy;

T r a in in g L o s s
b) NUFM-Random: select devices by NUFM, decide CPU- 1 8 0
cycle frequencies, RB allocation, and transmission power 1 6 0
2 2 0
2 0
by Random strategy;
c) RU-Greedy: select devices uniformly at random, decide 1 5
CPU-cycle frequencies, RB allocation, and transmission 1 1 0 1 0
0 1 0 2 0 3 0 4 0 5 0 0 1 0 2 0 3 0 4 0 5 0 0 1 0 2 0 3 0 4 0 5 0
power by Greedy strategy;
R o u n d R o u n d R o u n d
d) RU-Random: select devices uniformly at random, decide
CPU-cycle frequencies, RB allocation, and transmission Fig. 4. Comparison of convergence, energy cost, and wall-clock training time.
power by Random strategy. URAL can achieve a great convergence speed and short wall-clock time with
low energy consumption.
We simulate a wireless system consisting of 𝑀 = 20 RBs
and let the channel gain ℎ𝑖 of device 𝑖 follow a uniform
distribution 𝑈 (ℎmin , ℎmax ) with ℎmin = 0.1 and ℎmax = 1. number of participated devices may weaken the global loss
We set 𝑆 = 1, 𝐵 = 1, and 𝑛0 = 1. The interference of reduction as shown in (1). Therefore, URAL can adapt to
RB 𝑚 is drawn from 𝐼𝑚 ∼ 𝑈 (0, 0.8). We set 𝜄𝑖 ∼ 𝑈 (0, 1), the practical systems with constrained wireless resources via
𝑐 𝑖 ∼ 𝑈 (0, 0.25), 𝑝 max
𝑖 ∼ 𝑈 (0, 1), and 𝜈𝑖max ∼ 𝑈 (0, 2) for each achieving fast convergence with only a small set of devices.
𝑖 ∈ N to simulate the device heterogeneity. In the following
R U -R a n d o m R U -G re e d y N U F M -R a n d o m
N U F M -G re e d y U R A L
experiments, we run the algorithms on FMNIST with local
update steps 𝜏 = 1 and 𝜂1 = 𝜂2 = 1.
As shown in Fig. 4, URAL can significantly reduce energy 6 5 5 5 0
3 5 0

E n e r g y C o n s u m p tio n
2 .0
consumption and wall-clock time, as compared with the base-

W a ll- C lo c k T im e
4 5 1 8 0
T r a in in g L o s s

lines. However, it is counter-intuitive that the Greedy strategy


is not always better than Random. There are two reasons. 1 6 0

2 5 2 0
On one hand, the energy cost and wall-clock time depend
1 .5
1 5
on the selection of weight parameters 𝜂1 and 𝜂2 . The results
in Fig. 4 imply that, when 𝜂1 = 𝜂2 = 1, Greedy pays more
5 1 0
attention to the wall-clock time than to energy consumption. 0 1 0 2 0 3 0 4 0 5 0 1 0 2 0 3 0 4 0 5 0 1 0 2 0 3 0 4 0 5 0
Accordingly, Greedy achieves much lower average delay than # R B s # R B s # R B s
that of Random, but sacrificing parts of the energy. On Fig. 5. Comparison of convergence, energy cost, and wall-clock time under
the other hand, the wall-clock time and energy cost require different numbers of RBs. URAL can well control the energy cost and
joint control with RB allocation. Although Greedy minimizes wall-clock time with more available RBs. Meanwhile, it can achieve fast
convergence with only a small number of RBs.
the individual objectives (55)-(56), improper RB allocation
can cause arbitrary performance degradation. Different from 5) Effect of Channel Quality: To investigate the effect of
Greedy and Random–based baselines, since URAL aims to channel conditions on performance, we set the number of RBs
maximize the joint objective (P) via co-optimizing the CPU- 𝑀 = 20 and vary the maximum channel gain ℎmax from 0.25
cycle frequencies, transmission power, and RB allocation strat- to 2, and show the corresponding energy consumption and
egy, it can alleviate the above-mentioned issues, and achieve wall-clock training time in Fig. 6. The results indicate that the
a better delay and energy control. At the same time, Fig. energy consumption and latency decrease as channel quality
4 indicates that URAL converges as fast as NUFM-Greedy improves, because devices can use less power to achieve a
and NUFM-Random (the corresponding lines almost overlap), relatively large transmission rate.
which select devices greedily to accelerate convergence (as in 6) Effect of Weight Parameters: We study how weight
NUFM). Thus, URAL can have an excellent convergence rate. parameters 𝜂1 and 𝜂2 affect the average energy consumption
4) Effect of Resource Blocks: In Fig. 5, we test the per- and wall-clock time of URAL in Fig. 7. We first fix 𝜂2 = 1
formance of URAL under different numbers of RBs. We vary and vary 𝜂1 from 0.5 to 2.5. As expected, the total energy
the number of RBs 𝑀 from 1 to 50. More RBs enable more consumption decreases with the increase of 𝜂1 , with the
devices to be selected in each round, leading to larger energy opposite trend for wall-clock time. Then we vary 𝜂2 with
consumption. As shown in Fig. 5, URAL keeps stable wall- 𝜂1 = 1. Similarly, a larger 𝜂2 leads to less latency and more
clock time with the increase of RBs. Meanwhile, URAL can energy cost. It implies that we can control the levels of wall-
control the power in devices to avoid serious waste of energy. clock training time and energy consumption by tuning the
It is counter-intuitive that the convergence speed does not al- weight parameters. In particular, even with a large 𝜂1 or 𝜂2 ,
ways decrease with the increase of RBs, especially for URAL. the wall-clock time and energy cost can be controlled at low
The reason is indeed the same as that in Section VII-B1. That levels. Meanwhile, the convergence rate achieved by URAL
is, too few selected devices can slow the convergence due to is robust to 𝜂1 and 𝜂2 . Thus, URAL can make full use of
insufficient information provided in each round while a large the resources (including datasets, bandwidth, and power) and
13

TABLE III
PARAMETERS IN S IMULATION
Parameter Value
Step size (𝛼) and meta–learning rate (𝛽) 0.001
# edge devices (𝑛) 100
# participating devices (𝑛𝑘 ) in Experiments 1 and 2 An integer varying among {20, 30, 40}
# local updates (𝜏) An integer varying among {1, 2, . . . , 10}
Hyper-parameters (𝜆1 and 𝜆2 ) 1
Weight parameters (𝜂1 and 𝜂2 ) Real numbers varying among {0.5, 1, 1.5, 2.0, 2.5}
# RBs An integer varying among {1, 5, 10, 20, 30, 40, 50}
CPU cycles of devices(𝑐𝑖 ) Real numbers following 𝑈 (0, 0.25)
Maximum CPU-cycle frequencies of devices (𝜈𝑖max ) Real numbers following 𝑈 (0, 2)
Maximum transmission powers of devices ( 𝑝𝑖max ) Real numbers following 𝑈 (0, 1)
Inference in RBs (𝐼𝑚 ) Real numbers following 𝑈 (0, 0.8)
Model size (𝑆), Bandwidth (𝐵), 1
Noise power spectral density (𝑁0 ) 1
effective capacitance coefficients of devices (𝜄𝑖 ) Real numbers following 𝑈 (0, 1)
Real numbers following 𝑈 (0.1, ℎ max )
Channel gains of devices
where ℎ max varies among {0.25, 0.5, 0.75, . . . , 2.0}

R U -R a n d o m R U -G re e d y N U F M -R a n d o m R U -R a n d o m R U -G re e d y N U F M -R a n d o m
N U F M -G re e d y U R A L N U F M -G re e d y U R A L

6 0
5 5 0 5 5 0 4 0
5 0

E n e r g y C o n s u m p tio n
3 5 0
E n e r g y C o n s u m p tio n

3 5 0
W a ll- C lo c k T im e

1 8 0
W a ll- C lo c k T im e

4 0 1 8 0 3 0
1 6 0
1 6 0
3 0 2 5
2 0
3 0
2 0 1 5
1 5
1 0
1 0 0 0 .5 1 .0 1 .5 2 .0 2 .5 0 .5 1 .0 1 .5 2 .0 2 .5
0 .5 1 .0 1 .5 2 .0 0 .5 1 .0 1 .5 2 .0
C h a n n e l G a in C h a n n e l G a in
5 5 0 4 0

E n e r g y C o n s u m p tio n
Fig. 6. Effect of channel gains on performance. Worse channel conditions 3 5 0
W a ll- C lo c k T im e

1 8 0
3 0
would induce larger transmission power and longer wall-clock time.
1 6 0
2 5
2 0
achieve great trade-offs among the convergence rate, latency,
1 5
1 0
and energy consumption.
0 .5 1 .0 1 .5 2 .0 2 .5 0 .5 1 .0 1 .5 2 .0 2 .5

VIII. C ONCLUSION Fig. 7. Effect of weight parameters 𝜂1 and 𝜂2 . A large 𝜂1 achieves lower
energy consumption while leading to longer wall-clock time (the average wall-
In this paper, we have proposed an FML algorithm, called clock time is 18.63 when 𝜂1 = 0.5; it is 21.53 when 𝜂1 = 2.5). It is the
NUFM, that maximizes the theoretical lower bound of global opposite for 𝜂2 .
loss reduction in each round to accelerate the convergence.
Aiming at effectively deploying NUFM in wireless networks,
R EFERENCES
we present a device selection and resource allocation strategy
(URAL), which jointly controls the CPU-cycle frequencies and [1] J. Park, S. Samarakoon, M. Bennis, and M. Debbah, “Wireless network
RB allocation to optimize the trade-off between energy con- intelligence at the edge,” Proc. IEEE, vol. 107, no. 11, pp. 2204–2239,
2019.
sumption and wall-clock training time. Moreover, we integrate [2] G. Plastiras, M. Terzi, C. Kyrkou, and T. Theocharidcs, “Edge intel-
the proposed algorithms with two first-order approximation ligence: Challenges and opportunities of near-sensor machine learning
techniques to further reduce the computational complexity in applications,” in Proc. IEEE ASAP, 2018, pp. 1–7.
[3] X. Zhang, Y. Wang, S. Lu, L. Liu, W. Shi et al., “Openei: An open
IoT devices. Extensive simulation results demonstrate that the framework for edge intelligence,” in Proc. IEEE ICDCS, 2019, pp.
proposed methods outperform the baseline algorithms. 1840–1851.
Future work will investigate the trade-off between the local [4] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas,
“Communication-efficient learning of deep networks from decentralized
update and global aggregation in FML to minimize the conver- data,” in Proc. AISTATS. PMLR, 2017, pp. 1273–1282.
gence time and energy cost from a long-term perspective. In [5] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for
addition, how to characterize the convergence properties and fast adaptation of deep networks,” in Proc. ICML, 2017, pp. 1126–1135.
[6] F. Chen, M. Luo, Z. Dong, Z. Li, and X. He, “Federated meta-learning
communication complexity of the NUFM algorithm requires with fast convergence and efficient communication,” ArXiv reprints
further research. arXiv: 1802.07876, 2019.
14

[7] Y. Jiang, J. Konečný, K. Rush, and S. Kannan, “Improving feder- [32] W. Luping, W. Wei, and L. Bo, “Cmfl: Mitigating communication
ated learning personalization via model agnostic meta learning,” ArXiv overhead for federated learning,” in Proc. IEEE ICDCS, 2019, pp. 954–
reprints arXiv: 1909.12488, 2019. 964.
[8] S. Lin, G. Yang, and J. Zhang, “A collaborative learning framework via [33] J. Li, M. Khodak, S. Caldas, and A. Talwalkar, “Differentially private
federated meta-learning,” in Proc. IEEE ICDCS, 2020, pp. 289–299. meta-learning,” in Proc. ICLR, 2019.
[9] A. Fallah, A. Mokhtari, and A. Ozdaglar, “Personalized federated [34] K. C. Sim et al., “Personalization of end-to-end speech recognition on
learning with theoretical guarantees: A model-agnostic meta-learning mobile devices for named entities,” in Proc. IEEE ASRU, 2019, pp.
approach,” in Proc. NIPS, 2020, pp. 1–12. 23–30.
[10] P. Kairouz et al., “Advances and open problems in federated learning,” [35] W. Shi, S. Zhou, and Z. Niu, “Device scheduling with fast convergence
ArXiv reprints arXiv: 1912.04977, 2021. for wireless federated learning,” in Proc. IEEE ICC, 2020, pp. 1–6.
[11] S. Yue, J. Ren, J. Xin, S. Lin, and J. Zhang, “Inexact-admm based [36] Z. Yang, M. Chen, W. Saad, C. S. Hong, and M. Shikh-Bahaei, “Energy
federated meta-learning for fast and continual edge learning,” in Proc. efficient federated learning over wireless communication networks,”
ACM MobiHoc, 2021, p. 91–100. IEEE Trans. Wireless Commun., vol. 20, no. 3, pp. 1935–1949, 2020.
[12] T. Nishio and R. Yonetani, “Client selection for federated learning with [37] X. Zhang, M. Hong, S. Dhople, W. Yin, and Y. Liu, “Fedpd: A federated
heterogeneous resources in mobile edge,” in Proc. IEEE ICC, 2019, pp. learning framework with optimal rates and adaptivity to non-iid data,”
1–7. ArXiv reprints arXiv: 2005.11418, 2020.
[13] H. T. Nguyen, V. Sehwag, S. Hosseinalipour, C. G. Brinton, M. Chiang, [38] A. Fallah, A. Mokhtari, and A. Ozdaglar, “On the convergence theory
and H. V. Poor, “Fast-convergent federated learning,” IEEE J. Sel. Areas of gradient-based model-agnostic meta-learning algorithms,” in Proc.
Commun., vol. 39, no. 1, pp. 201–218, 2020. AISTATS, 2020, pp. 1082–1092.
[14] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui, “A joint [39] S. Yue, J. Ren, J. Xin, D. Zhang, Y. Zhang, and W. Zhuang, “Efficient
learning and communications framework for federated learning over federated meta-learning over multi-access wireless networks,” arXiv
wireless networks,” IEEE Trans. Wireless Commun., vol. 20, no. 1, pp. preprint arXiv:2108.06453, 2021.
269–283, 2020. [40] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction
[15] C. T. Dinh, N. H. Tran, M. N. Nguyen, C. S. Hong, W. Bao, A. Y. to algorithms. MIT press, 2009.
Zomaya, and V. Gramoli, “Federated learning over wireless networks: [41] T. D. Burd and R. W. Brodersen, “Processor design for portable
Convergence analysis and resource allocation,” IEEE/ACM Trans. Net- systems,” J. VLSI Sig. Proc. Syst. Sig. Image Video Technol., vol. 13,
working, vol. 29, no. 1, pp. 398–409, 2020. no. 2, pp. 203–221, 1996.
[16] J. Liu, J. Ren, Y. Zhang, X. Peng, Y. Zhang, and Y. Yang, “Efficient de- [42] E. W. Weisstein, “Hungarian maximum matching algorithm,”
pendent task offloading for multiple applications in mec-cloud system,” https://mathworld.wolfram.com/, 2011.
IEEE Trans. Mob. Comput., 2021. [43] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-mnist: a novel image
[17] Z. M. Fadlullah and N. Kato, “Hcp: Heterogeneous computing platform dataset for benchmarking machine learning algorithms,” ArXiv reprints
for federated learning based collaborative content caching towards 6g arXiv: 1708.07747, 2017.
networks,” IEEE Trans. Emerging Top. Comput., 2020. [44] A. Krizhevsky, “Learning multiple layers of features from tiny images,”
[18] Q. Wu, K. He, and X. Chen, “Personalized federated learning for Technical Report TR-2009, University of Toronto, 2009.
intelligent iot applications: A cloud-edge based framework,” IEEE Open [45] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:
J. Comput. Soc., vol. 1, pp. 35–44, 2020. A large-scale hierarchical image database,” in Proc. of IEEE CVPR,
[19] Q. Yang, Y. Liu, T. Chen, and Y. Tong, “Federated machine learning: 2009, pp. 248–255.
Concept and applications,” ACM Trans. Intell. Syst. Technol., vol. 10,
no. 2, pp. 1–19, 2019.
[20] S. Yue, J. Ren, N. Qiao, Y. Zhang, H. Jiang, Y. Zhang, and Y. Yang,
“Todg: Distributed task offloading with delay guarantees for edge
computing,” IEEE Trans. Parallel Distrib. Syst., 2021.
[21] J. Ren, Y. He, D. Wen, G. Yu, K. Huang, and D. Guo, “Scheduling
for cellular federated edge learning with importance and channel aware-
ness,” IEEE Trans. Wireless Commun., vol. 19, no. 11, pp. 7690–7703,
2020.
[22] G. Zhu, Y. Wang, and K. Huang, “Broadband analog aggregation for
low-latency federated edge learning,” IEEE Trans. Wireless Commun.,
vol. 19, no. 1, pp. 491–506, 2019.
[23] Q. Zeng, Y. Du, K. Huang, and K. K. Leung, “Energy-efficient radio
resource allocation for federated edge learning,” in Proc. IEEE ICC
Workshops, 2020, pp. 1–6.
[24] W. Shi, S. Zhou, Z. Niu, M. Jiang, and L. Geng, “Joint device schedul-
ing and resource allocation for latency constrained wireless federated
learning,” IEEE Trans. Wireless Commun., vol. 20, no. 1, pp. 453–467,
2020.
[25] M. Chen, H. V. Poor, W. Saad, and S. Cui, “Convergence time opti-
mization for federated learning over wireless networks,” IEEE Trans.
Wireless Commun., vol. 20, no. 4, pp. 2457–2471, 2020.
[26] J. Ren, G. Yu, and G. Ding, “Accelerating dnn training in wireless
federated edge learning systems,” IEEE J. Sel. Areas Commun., vol. 39,
no. 1, pp. 219–232, 2020.
[27] H. Yu, R. Jin, and S. Yang, “On the linear speedup analysis of communi-
cation efficient momentum sgd for distributed non-convex optimization,”
in Proc. ICML, 2019, pp. 7184–7193.
[28] J. Xu and H. Wang, “Client selection and bandwidth allocation in
wireless federated learning networks: A long-term perspective,” IEEE
Trans. Wireless Commun., vol. 20, no. 2, pp. 1188–1200, 2020.
[29] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and
K. Chan, “Adaptive federated learning in resource constrained edge
computing systems,” IEEE J. Sel. Areas in Commun., vol. 37, no. 6,
pp. 1205–1221, 2019.
[30] S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. T.
Suresh, “Scaffold: Stochastic controlled averaging for federated learn-
ing,” in Proc. ICML, 2020, pp. 5132–5143.
[31] B. Luo, X. Li, S. Wang, J. Huang, and L. Tassiulas, “Cost-effective
federated learning design,” in Proc. IEEE INFOCOM, 2021.
15

A PPENDIX A
P ROOF OF L EMMA 1
The proof is standard [9], [11], and we provide it for completeness. Recalling the definition of 𝐹𝑖 as 𝐹𝑖 (𝜃) B 𝑓𝑖 (𝜃 −𝛼∇ 𝑓𝑖 (𝜃)),
we have
∇𝐹𝑖 (𝜃) = (𝐼 − 𝛼∇2 𝑓𝑖 (𝜃))∇ 𝑓𝑖 (𝜃 − 𝛼∇ 𝑓𝑖 (𝜃)).
Based on that, for any 𝜃 1 , 𝜃 2 ∈ R𝑑 we have
k∇𝐹𝑖 (𝜃 1 ) − ∇𝐹𝑖 (𝜃 2 ) k = k (𝐼 − 𝛼∇2 𝑓𝑖 (𝜃 1 ))∇ 𝑓𝑖 (𝜃 1 − 𝛼∇ 𝑓𝑖 (𝜃 1 )) − (𝐼 − 𝛼∇2 𝑓𝑖 (𝜃 2 ))∇ 𝑓𝑖 (𝜃 2 − 𝛼∇ 𝑓𝑖 (𝜃 2 )) k
= k (𝐼 − 𝛼∇2 𝑓𝑖 (𝜃 1 )) (∇ 𝑓𝑖 (𝜃 1 − 𝛼∇ 𝑓𝑖 (𝜃 1 )) − ∇ 𝑓𝑖 (𝜃 2 − 𝛼∇ 𝑓𝑖 (𝜃 2 )))
+ ((𝐼 − 𝛼∇2 𝑓𝑖 (𝜃 1 )) − (𝐼 − 𝛼∇2 𝑓𝑖 (𝜃 2 )))∇ 𝑓𝑖 (𝜃 2 − 𝛼∇ 𝑓𝑖 (𝜃 2 )) k
(adding and subtracting the term (𝐼 − 𝛼∇2 𝑓𝑖 (𝜃 1 ))∇ 𝑓𝑖 (𝜃 2 − 𝛼∇ 𝑓𝑖 (𝜃 2 ))
≤ k𝐼 − 𝛼∇2 𝑓𝑖 (𝜃 1 ) k k∇ 𝑓𝑖 (𝜃 1 − 𝛼∇ 𝑓𝑖 (𝜃 1 )) − ∇ 𝑓𝑖 (𝜃 2 − 𝛼∇ 𝑓𝑖 (𝜃 2 )) k (from triangle inequality)
| {z }
(𝑎)
+ 𝛼 k∇2 𝑓𝑖 (𝜃 1 ) − ∇2 𝑓𝑖 (𝜃 2 ) k k∇ 𝑓𝑖 (𝜃 2 − 𝛼∇ 𝑓𝑖 (𝜃 2 )) k . (57)
| {z }
(𝑏)

Then, for (𝑎), the following holds true


k𝐼 − 𝛼∇2 𝑓𝑖 (𝜃 1 ) k k∇ 𝑓𝑖 (𝜃 1 − 𝛼∇ 𝑓𝑖 (𝜃 1 )) − ∇ 𝑓𝑖 (𝜃 2 − 𝛼∇ 𝑓𝑖 (𝜃 2 )) k
≤ (1 + 𝛼𝐿)𝐿 k𝜃 1 − 𝛼∇ 𝑓𝑖 (𝜃 1 )) − 𝜃 2 + 𝛼∇ 𝑓𝑖 (𝜃 2 ) k (from Assumption 1)
≤ (1 + 𝛼𝐿)𝐿 (k𝜃 1 − 𝜃 2 k + 𝛼k∇ 𝑓𝑖 (𝜃 1 ) − ∇ 𝑓𝑖 (𝜃 2 ) k) (from triangle inequality)
2
≤ (1 + 𝛼𝐿) 𝐿k𝜃 1 − 𝜃 2 k. (58)
Regarding (𝑏), it can be shown that
k∇2 𝑓𝑖 (𝜃 1 ) − ∇2 𝑓𝑖 (𝜃 2 ) k k∇ 𝑓𝑖 (𝜃 2 − 𝛼∇ 𝑓𝑖 (𝜃 2 )) k ≤ 𝜌𝜁 k𝜃 1 − 𝜃 2 k. (from Assumptions 1 and 2)
Substituting (𝑎) and (𝑏) into (57), we have the result.

A PPENDIX B
P ROOF OF L EMMA 2
˜ 𝑖 (𝜃) as follows
We rewrite the stochastic gradient ∇𝐹
 
˜ 𝑖 (𝜃) = 𝐼 − 𝛼∇2 𝑓𝑖 (𝜃) + 𝛿∗ ∇ 𝑓𝑖 (𝜃 − 𝛼∇ 𝑓𝑖 (𝜃)) + 𝛿∗

∇𝐹 1 2 (59)
where 𝛿1∗ and 𝛿2∗ are given by
 
𝛿1∗ = 𝛼 ∇2 𝑓𝑖 (𝜃) − ∇˜ 2 𝑓𝑖 (𝜃, D𝑖00) (60)
𝛿2∗ = ∇˜ 𝑓𝑖 𝜃 − 𝛼 ∇˜ 𝑓𝑖 (𝜃, D𝑖 ), D𝑖0 − ∇ 𝑓𝑖 (𝜃 − 𝛼∇ 𝑓𝑖 (𝜃)) .

(61)
Note that 𝛿1∗ and 𝛿2∗ are independent. Due to Assumption 3, we have
E[𝛿1∗ ] = 0 (62)
(𝛼𝜎𝐻 ) 2
E[k𝛿1∗ k 2 ] ≤ . (63)
𝐷 𝑖00
Next, we proceed to bound the first and second moments of 𝛿2∗ . Regarding the first moment, we have
 ∗ 
E 𝛿 = E ∇˜ 𝑓𝑖 𝜃 − 𝛼 ∇˜ 𝑓𝑖 (𝜃, D𝑖 ), D 0 − ∇ 𝑓𝑖 𝜃 − 𝛼 ∇˜ 𝑓𝑖 (𝜃, D𝑖 ) + ∇ 𝑓𝑖 𝜃 − 𝛼 ∇˜ 𝑓𝑖 (𝜃, D𝑖 ) − ∇ 𝑓𝑖 (𝜃 − 𝛼∇ 𝑓𝑖 (𝜃))
   
2 𝑖
= E ∇ 𝑓𝑖 𝜃 − 𝛼 ∇˜ 𝑓𝑖 (𝜃, D𝑖 ) − ∇ 𝑓𝑖 (𝜃 − 𝛼∇ 𝑓𝑖 (𝜃)) (from the tower rule and independence between D and D 0)
  

≤ E ∇ 𝑓𝑖 𝜃 − 𝛼 ∇˜ 𝑓𝑖 (𝜃, D𝑖 ) − ∇ 𝑓𝑖 (𝜃 − 𝛼∇ 𝑓𝑖 (𝜃))
  

≤ 𝛼𝐿E ∇˜ 𝑓𝑖 (𝜃, D𝑖 ) − ∇ 𝑓𝑖 (𝜃)


 
(from the smoothness of 𝑓𝑖 )
𝛼𝜎𝐺 𝐿
≤ √ . (from Assumption 3)
𝐷𝑖
Regarding the second moment, we have
h  2 i h 2 i
E k𝛿2∗ k 2 ≤ 2E ∇˜ 𝑓𝑖 𝜃 − 𝛼 ∇˜ 𝑓𝑖 (𝜃, D𝑖 ), D𝑖0 − ∇ 𝑓𝑖 𝜃 − 𝛼 ∇˜ 𝑓𝑖 (𝜃, D𝑖 ) + 2E ∇ 𝑓𝑖 𝜃 − 𝛼 ∇˜ 𝑓𝑖 (𝜃, D𝑖 ) − ∇ 𝑓𝑖 (𝜃 − 𝛼∇ 𝑓𝑖 (𝜃))
   
16

2
2𝜎𝐺 h i
≤ + 2(𝛼𝐿) 2
E ∇˜ 𝑓𝑖 (𝜃, D𝑖 ) − ∇ 𝑓𝑖 (𝜃) 2 (from the tower rule, smoothness of 𝑓𝑖 along with Assumption 3)
𝐷 𝑖0
(𝛼𝐿) 2
 
2 1
≤ 2𝜎𝐺 + . (64)
𝐷 𝑖0 𝐷𝑖
Based on (59), we have
  
˜ 𝑖 (𝜃) − ∇𝐹𝑖 (𝜃) = 𝐼 − 𝛼∇2 𝑓𝑖 (𝜃) E[𝛿∗ ] + E[𝛿∗ ]∇ 𝑓𝑖 (𝜃 − 𝛼∇ 𝑓𝑖 (𝜃)) + E[𝛿∗ 𝛿∗ ]

E ∇𝐹
2 1 1 2

≤ 𝐼 − 𝛼∇2 𝑓𝑖 (𝜃) E[𝛿2∗ ]



(from (62) and submultiplicative property of matrix norm)
𝛼𝜎𝐺 𝐿 (1 + 𝛼𝐿)
≤ √ (65)
𝐷𝑖
which gives us the first result in Lemma 2.
By the submultiplicative property of matrix norm,
˜ 𝑖 (𝜃) − ∇𝐹𝑖 (𝜃) ≤ 𝐼 − 𝛼∇2 𝑓𝑖 (𝜃) 𝛿∗ + k∇ 𝑓𝑖 (𝜃 − 𝛼∇ 𝑓𝑖 (𝜃)) k 𝛿∗ + 𝛿∗ 𝛿∗ .

∇𝐹 (66)
2 1 1 2

Thus, we have
˜ 𝑖 (𝜃) − ∇𝐹𝑖 (𝜃) 2 ≤ 3 𝐼 − 𝛼∇2 𝑓𝑖 (𝜃) 2 𝛿∗ 2 + 3 k∇ 𝑓𝑖 (𝜃 − 𝛼∇ 𝑓𝑖 (𝜃)) k 2 𝛿∗ 2 + 3 𝛿∗ 2 𝛿∗ 2 .

∇𝐹 (67)
2 1 1 2

Taking expectation on both sizes, we obtain


h i h i h i h i h i
E ∇𝐹 ˜ 𝑖 (𝜃) − ∇𝐹𝑖 (𝜃) 2 ≤ 3 (1 + 𝛼𝐿) 2 E 𝛿∗ 2 + 3𝜁 2 E 𝛿∗ 2 + 3E 𝛿∗ 2 E 𝛿∗ 2
2 1 1 2

(using the fact that k𝐼 − 𝛼∇2 𝑓𝑖 (𝜃) k ≤ 1 + 𝛼𝐿)


6(𝛼𝜎𝐺 𝜎𝐻 ) 2 1 2 (𝛼𝐿) 2 3(𝛼𝜁 𝜎𝐻 ) 2
   
(𝛼𝐿) 2 2 1
≤ + + 6𝜎𝐺 (1 + 𝛼𝐿) + + (68)
𝐷 𝑖00 𝐷 𝑖0 𝐷𝑖 𝐷𝑖0 𝐷𝑖 𝐷 𝑖00
which completes the proof.

A PPENDIX C
P ROOF OF L EMMA 3
Recall the definition of 𝐹𝑖 (𝜃). We have
    
∇𝐹𝑖 (𝜃) − 𝐹 𝑗 (𝜃) = 𝐼 − 𝛼∇2 𝑓𝑖 (𝜃) ∇ 𝑓𝑖 (𝜃 − 𝛼∇ 𝑓𝑖 (𝜃)) − 𝐼 − 𝛼∇2 𝑓 𝑗 (𝜃) ∇ 𝑓 𝑗 𝜃 − 𝛼∇ 𝑓 𝑗 (𝜃)

   
≤ 𝐼 − 𝛼∇2 𝑓𝑖 (𝜃) ∇ 𝑓𝑖 (𝜃 − 𝛼∇ 𝑓𝑖 (𝜃)) − 𝐼 − 𝛼∇2 𝑓 𝑗 (𝜃) ∇ 𝑓𝑖 (𝜃 − 𝛼∇ 𝑓𝑖 (𝜃))

    
+ 𝐼 − 𝛼∇2 𝑓 𝑗 (𝜃) ∇ 𝑓𝑖 (𝜃 − 𝛼∇ 𝑓𝑖 (𝜃)) − 𝐼 − 𝛼∇2 𝑓 𝑗 (𝜃) ∇ 𝑓 𝑗 𝜃 − 𝛼∇ 𝑓 𝑗 (𝜃)

 
≤ 𝛼 ∇2 𝑓𝑖 (𝜃) − ∇2 𝑓 𝑗 (𝜃) ∇ 𝑓𝑖 (𝜃 − 𝛼∇ 𝑓𝑖 (𝜃))

| {z }
𝐴
   
+ 𝐼 − 𝛼∇2 𝑓 𝑗 (𝜃) ∇ 𝑓𝑖 (𝜃 − 𝛼∇ 𝑓𝑖 (𝜃)) − ∇ 𝑓 𝑗 𝜃 − 𝛼∇ 𝑓 𝑗 (𝜃) . (69)

| {z }
𝐵
Regrading 𝐴, due to the submultiplicative property of matrix norm, the following holds
𝐴 ≤ ∇2 𝑓𝑖 (𝜃) − ∇2 𝑓 𝑗 (𝜃) k∇ 𝑓𝑖 (𝜃 − 𝛼∇ 𝑓𝑖 (𝜃)) k

≤ 𝜁 𝛾𝐻 (70)
where the last inequality follows from Assumption 4.
Regrading 𝐵, similarly, we obtain
𝐵 ≤ 𝐼 − 𝛼∇2 𝑓 𝑗 (𝜃) ∇ 𝑓𝑖 (𝜃 − 𝛼∇ 𝑓𝑖 (𝜃)) − ∇ 𝑓 𝑗 𝜃 − 𝛼∇ 𝑓 𝑗 (𝜃)


(a) 
≤ (1 + 𝛼𝐿) ∇ 𝑓𝑖 (𝜃 − 𝛼∇ 𝑓𝑖 (𝜃)) − ∇ 𝑓 𝑗 𝜃 − 𝛼∇ 𝑓 𝑗 (𝜃)
 
≤ (1 + 𝛼𝐿) ∇ 𝑓𝑖 (𝜃 − 𝛼∇ 𝑓𝑖 (𝜃)) − ∇ 𝑓𝑖 𝜃 − 𝛼∇ 𝑓 𝑗 (𝜃)
  
+ ∇ 𝑓𝑖 𝜃 − 𝛼∇ 𝑓 𝑗 (𝜃) − ∇ 𝑓 𝑗 𝜃 − 𝛼∇ 𝑓 𝑗 (𝜃)
(b) 
≤ (1 + 𝛼𝐿) 𝛼𝐿 ∇ 𝑓𝑖 (𝜃) − 𝑓 𝑗 (𝜃) + 𝛾𝐺
17

≤ (1 + 𝛼𝐿) 2 𝛾𝐺 (71)

where (a) is derived via triangle inequality, and (b) follows from Assumptions 1 and 4.
Substituting (70) and (71) in (69), we have
∇𝐹𝑖 (𝜃) − 𝐹 𝑗 (𝜃) ≤ (1 + 𝛼𝐿) 2 𝛾𝐺 + 𝛼𝜁 𝛾 𝐻

(72)

thereby completing the proof.

A PPENDIX D
P ROOF OF L EMMA 4
Since 𝑗 is the straggler among devices, (SP1) can be equivalently transformed to
∑︁ 𝜄𝑖 𝑐𝑗𝐷𝑗
min 𝜂1 𝑐 𝑖 𝐷 𝑖 𝜈𝑖2 + 𝜂2
𝝂
𝑖 ∈N
2 𝜈𝑗
s.t. 0 ≤ 𝜈𝑖 ≤ 𝜈𝑖max , ∀𝑖 ∈ N (73)
𝑐𝑖 𝐷 𝑖 𝜈 𝑗
≤ 𝜈𝑖 , ∀𝑖 ∈ N / 𝑗 .
𝑐𝑗𝐷𝑗

Fixing 𝜈 𝑗 , for each 𝑖 ∈ N / 𝑗 5 , we can show that the optimal CPU frequency 𝜈𝑖∗ of the problem (73) can be obtained via solving
the following decomposed convex optimization problem
𝜄𝑖
min 𝜂1 𝑐 𝑖 𝐷 𝑖 𝜈𝑖2
𝜈𝑖 2
s.t. 0 ≤ 𝜈𝑖 ≤ 𝜈𝑖max (74)
𝑐𝑖 𝐷 𝑖 𝜈 𝑗
≤ 𝜈𝑖 .
𝑐𝑗𝐷𝑗
𝑐 𝑗 𝐷 𝑗 𝜈𝑖max
If 𝜈 𝑗 ≤ 𝑐𝑖 𝐷𝑖 , the optimal solution of problem (74) is given by

𝑐𝑖 𝐷 𝑖 𝜈 𝑗
𝜈𝑖∗ = . (75)
𝑐𝑗𝐷𝑗
Otherwise, the problem is infeasible because the constraints are mutually contradictory. Substituting 𝜈𝑖 = 𝜈𝑖∗ in problem (73),
we have the following problem with respect to 𝜈 𝑗

𝜄𝑖 (𝑐 𝑖 𝐷 𝑖 ) 3 𝜄 𝑗 𝑐 𝑗 𝐷 𝑗 2
 ∑︁ 
1
min 𝑔(𝜈 𝑗 ) = 𝜂1 2
+ 𝜈 𝑗 + 𝜂2 𝑐 𝑗 𝐷 𝑗
𝜈𝑗 2(𝑐 𝐷
𝑗 𝑗 ) 2 | {z } 𝑗 𝜈
𝑖 ∈N/ 𝑗
𝑎2
| {z } (76)
𝑎1
𝑐 𝑗 𝐷 𝑗 𝜈𝑖max
s.t. 0 ≤ 𝜈𝑗 ≤ , ∀𝑖 ∈ N .
𝑐𝑖 𝐷 𝑖
We simplify the expression of 𝑔(𝜈 𝑗 ) as 𝑔(𝜈 𝑗 ) = 𝑎 1 𝜈 2𝑗 + 𝑎 2 /𝜈 𝑗 where positive constants 𝑎 1 and 𝑎 2 are defined in (76). Then,
the derivative of 𝑔(𝜈 𝑗 ) can be written as
𝑎2
𝑔 0 (𝜈 𝑗 ) = 2𝑎 1 𝜈 𝑗 − . (77)
𝜈 2𝑗
√︃
Based on (77), the minimum value of 𝑔(𝜈 𝑗 ) is obtained at its stationary point 𝜈¯ 𝑗 B 𝑔 0−1 (0) = 3 𝑎2
2𝑎1 . Thus, the optimal solution
of problem (76) is

𝑐 𝑗 𝐷 𝑗 𝜈 max
(√︂ )
∗ 3
𝑎2 𝑗
𝜈 𝑗 = min , min . (78)
2𝑎 1 𝑖 ∈N 𝑐 𝑖 𝐷 𝑖

Combining (75) and (78), we complete the proof.

5 We implicitly assume that N/ 𝑗 is not empty.


18

A PPENDIX E
P ROOF OF L EMMA 5

Given 𝒛∗ and 𝛿∗ , 𝑝 ∗𝑖 can be obtained via solving the following problem


∑︁ 𝑧∗𝑖,𝑚 𝑝 𝑖
min  
𝑝𝑖 ℎ𝑖 𝑝𝑖
𝑚∈M log2 1 + 𝐼𝑚 +𝐵𝑁0
s.t. 0 ≤ 𝑝 𝑖 ≤ 𝑝 max
𝑖 (79)
∑︁ 𝑧∗𝑖,𝑚 𝑆
  ≤ 𝛿∗
ℎ𝑖 𝑝𝑖
𝑚∈M 𝐵 log 2 1 + 𝐼𝑚 +𝐵 𝑁0

where we eliminate 𝑢 𝑖 , 𝜂1 , 𝑆, and 𝐵 in the objective. Clearly, if


Í ∗ = 0, then 𝑝 ∗𝑖 = 0. If the constraints in (79) are
𝑚∈M 𝑧 𝑖,𝑚
mutually contradictory, i.e.
𝑆
(𝐼𝑚𝑖∗ + 𝐵𝑁0 ) (2 𝐵 𝛿∗ − 1)
𝑝 max
𝑖 < (80)
ℎ𝑖
then 𝑧 ∗𝑖,𝑚 must be 0, which gives (41).
When there exists 𝑚 𝑖∗ such that 𝑧∗𝑖,𝑚∗ = 1, by denoting 𝑝˜𝑖 B ℎ𝑖 𝑝𝑖
𝐼𝑚∗ +𝐵 𝑁0 and rearranging the terms, we can transform the
𝑖 𝑖
problem (79) to
𝑝˜𝑖
min 𝑔3 ( 𝑝˜𝑖 ) =
𝑝˜ 𝑖 log2 (1 + 𝑝˜𝑖 )
ℎ𝑖 𝑝 max
𝑖 (81)
s.t. 0 ≤ 𝑝˜𝑖 ≤
𝐼𝑚𝑖∗ + 𝐵𝑁0
𝑆
2 𝐵 𝛿∗ − 1 ≤ 𝑝˜𝑖 .

We have
log2 (1 + 𝑝˜𝑖 ) − 𝑝˜𝑖 /((1 + 𝑝˜𝑖 ) ln 2)
𝑔30 ( 𝑝˜𝑖 ) = 2 . (82)
log2 (1 + 𝑝˜𝑖 )

Denoting the numerator of (82) as 𝑓3 ( 𝑝˜𝑖 ), we have 𝑓3 (0) = 0 and


 
0 1 1 1
𝑓3 ( 𝑝˜𝑖 ) = − > 0, when 𝑝˜𝑖 > 0 (83)
ln 2 (1 + 𝑝˜𝑖 ) (1 + 𝑝˜𝑖 ) 2

which implies that 𝑔3 ( 𝑝˜𝑖 ) is monotonically increasing with 𝑝˜𝑖 > 0. Thus, recalling the definition of 𝑝˜𝑖 , due to (41), we obtain
𝑆
(𝐼𝑚𝑖∗ + 𝐵𝑁0 ) (2 𝐵 𝛿∗ − 1)
𝑝 ∗𝑖 = (84)
ℎ𝑖
which completes the proof of (42).

A PPENDIX F
P ROOF OF L EMMA 6

If 𝑗 denotes the straggler among all devices and N ∗ ≠ ∅, we can rewrite (45) for each 𝑖 ∈ N ∗ as
∑︁ 𝑆 ∑︁ 𝑆
𝑧 ∗𝑖,𝑚  ≤ 𝑧 ∗𝑗,𝑚 ℎ 𝑗 𝑝 ∗𝑗
ℎ𝑖 𝑝𝑖∗
  
𝑚∈M 𝐵 log2 1 + 𝐼𝑚 +𝐵 𝑁0 𝑚∈M 𝐵 log2 1 + 𝐼𝑚 +𝐵 𝑁0
ℎ 𝑗 𝑝 ∗𝑗
! !
ℎ𝑖 𝑝 ∗𝑖
⇔ log2 1 + ≤ log2 1 +
𝐼𝑚∗𝑗 + 𝐵𝑁0 𝐼𝑚𝑖∗ + 𝐵𝑁0
(𝐼𝑚𝑖∗ + 𝐵𝑁0 )ℎ 𝑗
⇔ 𝑝 ∗𝑗 ≤ 𝑝 ∗𝑖 . (85)
(𝐼𝑚∗𝑗 + 𝐵𝑁0 )ℎ𝑖
19

Based on (85), by eliminating constants 𝐵 and 𝑆, we transform (SP2) under fixed 𝒛 ∗ to


∑︁ 𝑝𝑖 1
min 𝜂1   + 𝜂2  
{ 𝑝𝑖 }𝑖∈N ∗ ℎ𝑗 𝑝𝑗
𝑖 ∈N ∗ ℎ𝑖 𝑝𝑖
log2 1 + 𝐼 ∗ +𝐵 𝑁0 log2 1 + 𝐼 ∗ +𝐵𝑁0
𝑚 𝑚
𝑖 𝑗

s.t. 0 ≤ 𝑝𝑖 ≤ 𝑝 max
∀𝑖 ∈ N ∗ (86)
𝑖 ,
(𝐼𝑚𝑖∗ + 𝐵𝑁0 )ℎ 𝑗
𝑝 𝑗 ≤ 𝑝 𝑖 , ∀𝑖 ∈ N ∗ / 𝑗 .
(𝐼𝑚∗𝑗 + 𝐵𝑁0 )ℎ𝑖
Due to (85), for each 𝑖 ∈ N ∗ / 𝑗, we can obtain the optimal solution 𝑝 ∗𝑖 of (86) (abusing notations) via solving the decomposed
problem as follows
𝑝˜𝑖
min 𝑔3 ( 𝑝˜𝑖 ) =
𝑝˜ 𝑖 log2 (1 + 𝑝˜𝑖 )
ℎ𝑖 𝑝 max
𝑖
s.t. 0 ≤ 𝑝˜𝑖 ≤ (87)
𝐼𝑚𝑖∗ + 𝐵𝑁0
ℎ𝑗
𝑝 𝑗 ≤ 𝑝˜𝑖
𝐼𝑚∗𝑗 + 𝐵𝑁0
ℎ𝑖 𝑝𝑖
where 𝑝˜𝑖 B 𝐼𝑚∗ +𝐵 𝑁0 . Note that in (87) we consider the nontrivial case where N / 𝑗 is not empty. It is easy to see that
𝑖

log2 (1 + 𝑝˜𝑖 ) − 𝑝˜𝑖 /((1 + 𝑝˜𝑖 ) ln 2)


𝑔30 ( 𝑝˜𝑖 ) = 2 . (88)
log2 (1 + 𝑝˜𝑖 )
Denoting the numerator of (88) as 𝑓3 ( 𝑝˜𝑖 ), we have 𝑓 (0) = 0 and
 
0 1 1 1
𝑓3 ( 𝑝˜𝑖 ) = − > 0, when 𝑝˜𝑖 > 0 (89)
ln 2 (1 + 𝑝˜𝑖 ) (1 + 𝑝˜𝑖 ) 2
ℎ𝑖 𝑝𝑖max (𝐼𝑚∗ +𝐵 𝑁0 )
which implies that 𝑔3 ( 𝑝˜𝑖 ) is monotonically increasing with 𝑝˜𝑖 > 0. Thus, if 𝑝 𝑗 ≤ 𝑗
ℎ 𝑗 (𝐼𝑚∗ +𝐵𝑁0 ) , recalling the definition of
𝑖
𝑝˜𝑖 , we have
ℎ 𝑗 (𝐼𝑚𝑖∗ + 𝐵𝑁0 )
𝑝 ∗𝑖 = 𝑝 𝑗 , ∀𝑖 ∈ N ∗ / 𝑗 . (90)
ℎ𝑖 (𝐼𝑚∗𝑗 + 𝐵𝑁0 )
Otherwise, problem (87) is infeasible.
ℎ𝑗 𝑝𝑗 (𝐼𝑚∗ +𝐵 𝑁0 ) 𝑝˜ 𝑗
Similar to (87), denote 𝑝˜ 𝑗 B 𝐼𝑚∗ +𝐵 𝑁0 and institute 𝑝 𝑖 = 𝑝 ∗𝑖 in (86) (noting that 𝑝 𝑖 = 𝑖
ℎ𝑖 ), whereby we have the
𝑗
following problem regarding 𝑝˜ 𝑗
∑︁ 𝐼𝑚∗ + 𝐵𝑁0 𝑝˜ 𝑗 𝜂2
min 𝑔4 ( 𝑝˜ 𝑗 ) = 𝜂1 𝑖
·  + 
𝑝˜ 𝑗
𝑖 ∈N ∗
ℎ𝑖 log2 1 + 𝑝˜ 𝑗 log2 1 + 𝑝˜ 𝑗
| {z }
𝑏1
(91)
ℎ𝑖 𝑝 max
s.t. 0 ≤ 𝑝˜ 𝑗 ≤ 𝑖
, ∀𝑖 ∈ N ∗ .
𝐼𝑚𝑖∗ + 𝐵𝑁0
𝑏1 𝑝˜ 𝑗 𝜂2
Denoting positive constant 𝑏 1 as in (91), we can write 𝑔4 ( 𝑝˜ 𝑗 ) = log2 (1+ 𝑝˜ 𝑗 ) + log2 (1+ 𝑝˜ 𝑗 ) and obtain
𝜂2
𝑔40 ( 𝑝˜ 𝑗 ) = 𝑏 1 𝑔30 ( 𝑝˜ 𝑗 ) − 2
(1 + 𝑝˜ 𝑗 ) log2 (1 + 𝑝˜ 𝑗 ) ln 2

log2 (1 + 𝑝˜ 𝑗 ) − 𝑝˜ 𝑗 / (1 + 𝑝˜ 𝑗 ) ln 2 𝜂2
= 𝑏1 2 − 2
log2 (1 + 𝑝˜ 𝑗 ) (1 + 𝑝˜ 𝑗 ) log2 (1 + 𝑝˜ 𝑗 ) ln 2

𝑏 1 (1 + 𝑝˜ 𝑗 ) log2 (1 + 𝑝˜ 𝑗 ) ln 2 − 𝑝˜ 𝑗 − 𝜂2
= 2 . (92)
(1 + 𝑝˜ 𝑗 ) log2 (1 + 𝑝˜ 𝑗 ) ln 2
Next, we show that 𝑔4 ( 𝑝˜ 𝑗 ) has unique minimum point. Denoting the numerator of (92) as 𝑓4 ( 𝑝˜ 𝑗 ), we have 𝑓4 (0) = −𝜂2 ≤ 0
and
𝑓40 ( 𝑝˜ 𝑗 ) = 𝑏 1 log2 (1 + 𝑝˜ 𝑗 ) ln 2 ≥ 0, when 𝑝˜ 𝑗 ≥ 0 (93)
20

√︃ 𝜂
(1+ max{ 2 ,1}−1)/ln 2
which implies 𝑓4 ( 𝑝˜ 𝑗 ) is monotonically increasing. Besides, it can be shown that 𝑓4 (𝑏 2 ) ≥ 0, where 𝑏 2 = 2 𝑏1
.
0 0 0 0
Thus, there must exist a unique point 𝑝˜ 𝑗 ∈ (0, 𝑏 2 ] such that 𝑔4 ( 𝑝˜ 𝑗 ) = 𝑓4 ( 𝑝˜ 𝑗 ) = 0 holds, and it is also the minimum value of
𝑔4 ( 𝑝˜ 𝑗 ).
Accordingly, the optimal solution of the problem (91) can be expressed as
( )
∗ 0
ℎ𝑖 𝑝 max
𝑖
𝑝˜ 𝑗 = min 𝑝˜ 𝑗 , min∗ . (94)
𝑖 ∈N 𝐼 𝑚∗ + 𝐵𝑁0
𝑖

Therefore, we complete the proof.

A PPENDIX G
P ROOF OF T HEOREM 1
From Lemma 1 (smoothness of 𝐹𝑖 ), for any 𝜃 1 , 𝜃 2 ∈ R𝑑 , we have
𝐿𝐹
𝐹 (𝜃 2 ) ≤ 𝐹 (𝜃 1 ) + ∇𝐹 (𝜃 1 ) > (𝜃 2 − 𝜃 1 ) + k𝜃 2 − 𝜃 1 k 2 . (95)
2
Combining (95) with (2), we have
𝐿 𝐹 𝑘+1
𝐹 (𝜃 𝑘+1 ) ≤ 𝐹 (𝜃 𝑘 ) + ∇𝐹 (𝜃 𝑘 ) > (𝜃 𝑘+1 − 𝜃 𝑘 ) + k𝜃 − 𝜃 𝑘 k2
2
! 2
1 ∑︁ ˜ 𝐿 𝐹 𝛽2 1 ∑︁ ˜
= 𝐹 (𝜃 𝑘 ) − 𝛽∇𝐹 (𝜃 𝑘 ) > 𝑘
∇𝐹𝑖 (𝜃 ) + 𝑘
∇𝐹𝑖 (𝜃 ) . (96)
𝑛 𝑘 𝑖 ∈N 2 𝑛 𝑘 𝑖 ∈N
𝑘 𝑘

Denoting
! 2
𝑘 𝑘 > 1 ∑︁ ˜ 𝑘 𝐿 𝐹 𝛽2 1 ∑︁
˜ 𝑖 (𝜃 )
𝑘
𝐺 B 𝛽∇𝐹 (𝜃 ) ∇𝐹𝑖 (𝜃 ) − ∇𝐹 (97)


𝑛 𝑘 𝑖 ∈N 2 𝑛𝑘
𝑖 ∈N𝑘

𝑘

we have E[𝐹 (𝜃 𝑘 ) − 𝐹 (𝜃 𝑘+1 )] ≥ E[𝐺 𝑘 ]. Next, we bound E[𝐺 𝑘 ] from below.


First, for any 𝑖 ∈ N , we rewrite ∇𝐹 (𝜃 𝐾 ) as
1 ∑︁ ˜ 𝑖 (𝜃 𝑘 ) +∇𝐹
˜ 𝑖 (𝜃 𝑘 )
∇𝐹 (𝜃 𝑘 ) = ∇𝐹 𝑗 (𝜃 𝑘 ) − ∇𝐹𝑖 (𝜃 𝑘 ) + ∇𝐹𝑖 (𝜃 𝑘 ) − ∇𝐹 (98)
𝑛 𝑗 ∈N | {z }
| {z } 𝐵𝑖
𝐴𝑖

and bound E[k 𝐴𝑖 k2] by


 2 
2 𝑘 1 ∑︁ 𝑘
E[k 𝐴𝑖 k ] = E ∇𝐹𝑖 (𝜃 ) − ∇𝐹 𝑗 (𝜃 )

𝑛 𝑗 ∈N
 ∑︁   2 
1 𝑘 𝑘
=E ∇𝐹 𝑗 (𝜃 ) − ∇𝐹𝑖 (𝜃 )

𝑛 𝑗 ∈N
(a) 1 ∑︁ h 2 i
≤ E ∇𝐹 𝑗 (𝜃 𝑘 ) − ∇𝐹𝑖 (𝜃 𝑘 )
𝑛 𝑗 ∈N
(b)
≤ (1 + 𝛼𝐿) 2 𝛾𝐺 + 𝛼𝜁 𝛾 𝐻 (99)
where (b) is derived from Lemma 3. Inequality (a) follows from the fact that, for any 𝑎 𝑖 ∈ R𝑑 and 𝑏 𝑖 ∈ R,
2
𝑑
!2
∑︁ ∑︁ ∑︁
( 𝑗)
𝑏𝑖 𝑎𝑖 =

𝑏𝑖 𝑎𝑖

𝑖 ∈N𝑘 𝑗=1 𝑖 ∈N𝑘
𝑑
! !
∑︁ ∑︁   2 ∑︁
( 𝑗) 2
≤ 𝑎𝑖 𝑏𝑖
𝑗=1 𝑖 ∈N𝑘 𝑖 ∈N𝑘
𝑑
!
∑︁ ∑︁  2 ∑︁
( 𝑗) ª
=­ 𝑏 2𝑖
©
𝑎𝑖 ®
« 𝑗=1 𝑖 ∈N𝑘 ! ¬ 𝑖 ∈N !
𝑘

∑︁ ∑︁
= k𝑎 𝑖 k 2 𝑏 2𝑖 (100)
𝑖 ∈N𝑘 𝑖 ∈N𝑘
21

( 𝑗)
where 𝑎 𝑖 denotes the 𝑗-th coordinate of 𝑎 𝑖 , and the first inequality is derived by using Cauchy-Schwarz inequality.
Regarding E[k𝐵𝑖 k 2 ], from Lemma 2,

E[k𝐵𝑖 k 2 ] ≤ 𝜎𝐹2𝑖 . (101)

Substituting (98) in E[𝐺 𝑘 ], we obtain


! 2 
2 1 ∑︁

𝑘
 𝑘 > 1 ∑︁
˜ 𝑖 (𝜃 ) −
𝑘 𝐿 𝐹 𝛽 
˜ 𝑖 (𝜃 ) 
𝑘
E[𝐺 ] = E  𝛽∇𝐹 (𝜃 ) ∇𝐹 ∇𝐹


 𝑛 𝑘 𝑖 ∈N 2 𝑛 𝑘 𝑖 ∈N 
 𝑘 𝑘 
2 
2

𝛽 ∑︁   > 𝐿𝐹 𝛽 1 ∑︁ 
=E  ˜ 𝑘 ˜
𝐴𝑖 + 𝐵𝑖 + ∇𝐹𝑖 (𝜃 ) ∇𝐹𝑖 (𝜃 ) − 𝑘
˜ 𝑘 
∇𝐹𝑖 (𝜃 ) 
 𝑛 𝑘 𝑖 ∈N𝑘 2 𝑛 𝑘 𝑖 ∈N 
 𝑘 
" # 2 
2

𝛽 ∑︁  2
 𝐿 𝛽 1 ∑︁ 
𝐴𝑖> ∇𝐹
˜ 𝑖 (𝜃 𝑘 ) + 𝐵𝑖> ∇𝐹

=E ˜ 𝑖 (𝜃 𝑘 )
˜ 𝑖 (𝜃 𝑘 ) + ∇𝐹 −
𝐹
E  ˜ 𝑖 (𝜃 𝑘 )  .
∇𝐹 (102)
𝑛 𝑘 𝑖 ∈N 2  𝑛 𝑘 𝑖 ∈N𝑘 
𝑘  
Note that, for any random variables 𝑎, 𝑏 ∈ R𝑑 and for 𝑐 ≠ 0,
   2
!
𝑏 E k𝑏k
E 𝑎 > 𝑏 ≥ −E 2 (𝑐𝑎) > ≥ − 𝑐2 E k𝑎k 2 +
   
. (103)
2𝑐 4𝑐2

With 𝑔(𝑥) B 𝑥E[k𝑎k 2 ] + E[k𝑏k 2 ]/(4𝑥), we have


v
t
E k𝑏k 2
 

𝑥 = arg min 𝑔(𝑥) = (104)
4E k𝑎k 2
 
𝑥

which implies that if we set 𝑐2 = 𝑥 ∗ , the lower bound in (103) becomes tight. Thus, substituting this in (103) and rearranging
the terms, we have
√︃ 
E 𝑎 𝑏 ≥ − E k𝑎k 2 E k𝑏k 2 .
 >    
(105)

Based on (105) along with (100), due to the tower rule, we can bound E[𝐺 𝑘 ] as follows
" #
𝐿 𝐹 𝛽2 1 ∑︁ ˜
 2 
𝑘 𝛽 ∑︁  > ˜ 𝑘 >˜ 𝑘

˜ 𝑖 (𝜃 ) 𝑘
2 
E[𝐺 ] = E 𝐴𝑖 ∇𝐹𝑖 (𝜃 ) + 𝐵𝑖 ∇𝐹𝑖 (𝜃 ) + ∇𝐹 − ∇𝐹𝑖 (𝜃 𝑘 ) (from (102))

E
𝑛 𝑘 𝑖 ∈N 2 𝑛 𝑘 𝑖 ∈N
𝑘 𝑘
" #
𝛽 ∑︁  h > ˜ i h i 
𝑘 >˜ 𝑘 ˜ 𝑘 2
=E E 𝐴𝑖 ∇𝐹𝑖 (𝜃 ) N𝑘 + E 𝐵𝑖 ∇𝐹𝑖 (𝜃 ) N𝑘 + ∇𝐹𝑖 (𝜃 )
𝑛 𝑘 𝑖 ∈N
𝑘

𝐿 𝐹 𝛽2 1 ∑︁ ˜
 2 
𝑘
− E ∇𝐹𝑖 (𝜃 ) (using the tower rule)
2 𝑛 𝑘 𝑖 ∈N
𝑘
" √︂ h !#
𝛽 ∑︁ i √︂ h 2 i
√︂ h i √︂ h 2 i
2 ˜ 𝑖 (𝜃 𝑘 ) N𝑘 − E k𝐵𝑖 k N𝑘 2 ˜ 𝑖 (𝜃 𝑘 ) N𝑘
≥E − E k 𝐴𝑖 k N𝑘 E ∇𝐹 E ∇𝐹

𝑛 𝑘 𝑖 ∈N
𝑘
   2 
𝐿𝐹 𝛽 1 ∑︁ ˜
+𝛽 1− ∇𝐹𝑖 (𝜃 𝑘 ) (using (105) and (100), and rearranging terms)

E
2 𝑛 𝑘 𝑖 ∈N
𝑘
" √︃  √︂ h #
𝛽 ∑︁ 2
2 i
≥E − (1 + 𝛼𝐿) 𝛾𝐺 + 𝛼𝜁 𝛾 𝐻 + 𝜎𝐹𝑖 ˜
E ∇𝐹𝑖 (𝜃 ) N𝑘
𝑘 (from (99) and (101))
𝑛 𝑘 𝑖 ∈N
𝑘
   2 
𝐿𝐹 𝛽 1 ∑︁ ˜ 𝑘
+𝛽 1− E ∇𝐹𝑖 (𝜃 )
2 𝑛 𝑘 𝑖 ∈N
𝑘
"   √︃  √︂ h #
1 ∑︁ 𝐿 𝐹 𝛽 ˜ 𝑘 2
2
2 i
˜ 𝑖 (𝜃 𝑘 ) N𝑘
= 𝛽E 1− ∇𝐹𝑖 (𝜃 ) − (1 + 𝛼𝐿) 𝛾𝐺 + 𝛼𝜁 𝛾 𝐻 + 𝜎𝐹𝑖 E ∇𝐹 (106)
𝑛 𝑘 𝑖 ∈N 2
𝑘

thereby completing the proof.


22

A PPENDIX H
P ROOF OF T HEOREM 5
Í
First, it is easy to see that 𝑔2 (𝒛, 𝒑) is upper bounded by 𝑖 ∈N 𝑢 𝑖 . Besides, due to (43), we have
( 𝑆
! )
𝑡+1
∑︁ ∑︁ 𝜂1 𝛿 𝑡 (𝐼𝑚 + 𝐵𝑁0 ) (2 𝐵 𝛿 𝑡 − 1)
𝒛 = arg max 𝑧𝑖,𝑚 𝑢 𝑖 − , s.t. (26) − (35) . (107)
𝒛 𝑖 ∈N
ℎ𝑖
𝑚∈M

Based on (48), the optimal transmission power 𝒑ˆ 𝑡+1 corresponding to 𝒛 𝑡+1 is given by
𝑆
 (𝐼𝑚𝑖∗ +𝐵𝑁0 ) (2 𝐵 𝛿 𝑡 −1)

 Í 𝑡+1
, if 𝑚∈M 𝑧 𝑖,𝑚 =1

𝑝ˆ𝑖𝑡+1 = ℎ𝑖 (108)
0

, otherwise

where we slightly abuse the notation and denote the RB block allocated to 𝑖 as 𝑚 𝑖∗ , i.e., 𝑧𝑖,𝑚 𝑡+1
∗ = 1. Thus, we have 𝑔(𝒛 , 𝒑 ) ≤
𝑡 𝑡
𝑖
𝑔(𝒛 𝑡+1 , 𝒑ˆ 𝑡+1 ). From (49), 𝑔(𝒛 𝑡+1 , 𝒑ˆ 𝑡+1 ) ≤ 𝑔(𝒛 𝑡+1 , 𝒑 𝑡+1 ). Using these two inequalities, we obtain
𝑔(𝒛 𝑡 , 𝒑 𝑡 ) ≤ 𝑔(𝒛 𝑡+1 , 𝒑 𝑡+1 ) (109)
thereby completing the proof.

A PPENDIX I
P ROOF OF C OROLLARIES 1 AND 2
A. Proof of Corollary 1
1
𝜃 𝑖𝑘,𝑡 as a “virtual” global model in local step 𝑡 (𝜃 𝑖𝑘,𝑡 is defined in (2)). Similar to
Í
Denote auxiliary variable 𝜃 𝑘,𝑡 B 𝑛𝑘 𝑖 ∈N𝑘
(96), the following holds
𝐿 𝐹 𝑘,𝑡+1
𝐹 (𝜃 𝑘,𝑡+1 ) ≤ 𝐹 (𝜃 𝑘,𝑡 ) + ∇𝐹 (𝜃 𝑘,𝑡 ) > (𝜃 𝑘,𝑡+1 − 𝜃 𝑘,𝑡 ) + k𝜃 − 𝜃 𝑘,𝑡 k 2 (from Lemma 1)
2 !
1 ∑︁ 𝑘,𝑡+1 𝐿 𝐹 𝑘,𝑡+1
= 𝐹 (𝜃 𝑘,𝑡 ) + ∇𝐹 (𝜃 𝑘,𝑡 ) > 𝜃𝑖 − 𝜃 𝑖𝑘,𝑡 + k𝜃 − 𝜃 𝑘,𝑡 k 2 (from the definition of 𝜃 𝑘,𝑡 )
𝑛 𝑘 𝑖 ∈N 2
𝑘
! 2
𝑘,𝑡 𝑘,𝑡 > 1
∑︁
˜ 𝑘,𝑡 𝐿 𝐹 𝛽2 1 ∑︁ ˜
𝑘,𝑡
= 𝐹 (𝜃 ) − 𝛽∇𝐹 (𝜃 ) ∇𝐹𝑖 (𝜃 ) + ∇𝐹𝑖 (𝜃 ) . (from (2))
𝑛 𝑘 𝑖 ∈N 2 𝑛 𝑘 𝑖 ∈N
𝑘 𝑘

Denote
2
𝛽 ∑︁ 2 1 ∑︁
𝐺 𝑘,𝑡
B ˜ 𝑖 (𝜃 𝑘,𝑡 ) − 𝐿 𝐹 𝛽
∇𝐹 (𝜃 𝑘,𝑡 ) > ∇𝐹

˜ 𝑖 (𝜃 ) .
∇𝐹 𝑘,𝑡
(110)
𝑛 𝑘 𝑖 ∈N 2 𝑛𝑘
𝑖 ∈N𝑘

𝑘

We have E[𝐹 (𝜃 𝑘,𝑡 ) − 𝐹 (𝜃 𝑘,𝑡+1 )] ≥ E[𝐺 𝑘,𝑡 ]. Rewrite ∇𝐹 (𝜃 𝑘,𝑡 ) for each 𝑖 ∈ N𝑘 as follows
˜ 𝑖 (𝜃 𝑘,𝑡 ) +∇𝐹
∇𝐹 (𝜃 𝑘,𝑡 ) = ∇𝐹 (𝜃 𝑘,𝑡 ) − ∇𝐹 (𝜃 𝑖𝑘,𝑡 ) + ∇𝐹 (𝜃 𝑖𝑘,𝑡 ) − ∇𝐹𝑖 (𝜃 𝑖𝑘,𝑡 ) + ∇𝐹𝑖 (𝜃 𝑖𝑘,𝑡 ) − ∇𝐹 ˜ 𝑖 (𝜃 𝑘,𝑡 ). (111)
𝑖 𝑖
| {z } | {z } | {z }
𝐶𝑖 𝐴𝑖 𝐵𝑖

Note that in (111), 𝐴𝑖 and 𝐵𝑖 are similar to those in (98) with 𝜃 𝑘 being replaced by 𝜃 𝑖𝑘,𝑡 . Thus, from (99) and (101), E[k 𝐴𝑖 k 2 ]
and E[k𝐵𝑖 k 2 ] are bounded by
E[k 𝐴𝑖 k 2 ] ≤ (1 + 𝛼𝐿) 2 𝛾𝐺 + 𝛼𝜁 𝛾 𝐻 (112)
2
E[k𝐵𝑖 k ] ≤ 𝜎𝐹2𝑖 . (113)
Regarding E[k𝐶𝑖 k 2 ], we can write
 2 
2 𝑘,𝑡 𝑘,𝑡
E[k𝐶𝑖 k ] = E ∇𝐹 (𝜃 ) − ∇𝐹 (𝜃 𝑖 )

 2 
≤ 𝐿 2𝐹 E 𝜃 𝑘,𝑡 − 𝜃 𝑖𝑘,𝑡 . (From Lemma 1)

To bound E[k𝐶𝑖 k 2 ], denote


  2  
𝑘,𝑡 𝑘,𝑡
𝑎 𝑡 B max E 𝜃 − 𝜃 𝑖 (114)
𝑖 ∈N𝑘
23

with 𝑎 0 = 0. Then we have


  2  
𝑎 𝑡+1 = max E 𝜃 𝑘,𝑡+1 − 𝜃 𝑖𝑘,𝑡+1

𝑖 ∈N𝑘
  2  
 
˜ 𝑖 (𝜃 𝑘,𝑡 ) − 1

  𝑘,𝑡
 ∑︁    

= max E  𝜃 𝑖 − 𝛽 ∇𝐹 𝜃 𝑘,𝑡
− 𝛽 ˜ 𝑗 (𝜃 𝑘,𝑡 ) 
∇𝐹
𝑖 ∈N𝑘  𝑖 𝑛 𝑘 𝑗 ∈N 𝑗 𝑗

 

𝑘
 


  
  2  2  
    
1 ∑︁ 𝑘,𝑡  1 1


  ∑︁  

2

 𝑘,𝑡
≤ max (1 + 𝜖)E  𝜃 𝑖 − 𝜃𝑗  + 𝛽 1 + ˜ 𝑘,𝑡
E  ∇𝐹𝑖 (𝜃 𝑖 ) −
 ˜ 𝑘,𝑡
∇𝐹 𝑗 (𝜃 𝑗 ) 
𝑖 ∈N𝑘 
  𝑛 𝑘 𝑗 ∈N  𝜖  𝑛 𝑘 𝑗 ∈N  

  𝑘  𝑘 
    
(from [9, Equation (68)])
  2     2  
     
1 1 1

  𝑘,𝑡
 ∑︁ 

 

  ∑︁  

2
 ˜
≤ (1 + 𝜖) max E  𝜃 𝑖 −
𝑘,𝑡 
𝜃𝑗  + 𝛽 1 + max E  ∇𝐹𝑖 (𝜃 𝑖 ) − 𝑘,𝑡 ˜ 𝑘,𝑡 
∇𝐹 𝑗 (𝜃 𝑗 ) 
𝑖 ∈N𝑘  
 𝑛 𝑘 𝑗 ∈N 
  𝜖 𝑖 ∈N𝑘    𝑛 𝑘 𝑗 ∈N  

  𝑘    𝑘 
  (    ) 
2
   
1 ˜ 𝑖 (𝜃 𝑘,𝑡 ) − 1
∑︁
= (1 + 𝜖)𝑎 𝑡 + 𝛽2 1 +

max E ∇𝐹 ˜ 𝑗 (𝜃 𝑘,𝑡 )
∇𝐹 (115)
𝜖 𝑖 ∈N𝑘 𝑖 𝑛𝑘 𝑗
𝑗 ∈N𝑘
| {z }
𝐻𝑖

for any 𝜖 > 0. For 𝐻𝑖 , we first write


 2   2 
 
 1 ∑︁
 +2 E  1
 ∑︁ 
˜
 
˜

𝐻𝑖 ≤ 2 E  ∇𝐹𝑖 (𝜃 𝑖𝑘,𝑡 ) − ∇𝐹 𝑗 (𝜃 𝑘,𝑡 ) ∇𝐹 (𝜃 𝑘,𝑡
) − ∇𝐹 (𝜃 𝑘,𝑡
) + ∇𝐹 (𝜃 𝑘,𝑡
) − ∇𝐹 (𝜃 𝑘,𝑡 
)  . (116)

𝑗 𝑗 𝑗 𝑗 𝑗 𝑖 𝑖 𝑖 𝑖
 𝑛 𝑘 𝑗 ∈N
  𝑛 𝑘

𝑗 ∈N𝑘 
 𝑘 
   
| {z } | {z }
𝐻𝑖,1 𝐻𝑖,2

Next, we bound 𝐻𝑖,1 and 𝐻𝑖,2 separately.


• Upper Bound of 𝐻𝑖,1 : Based on Lemma 3, we have

 2 
 1 ∑︁  
𝑘,𝑡 𝑘,𝑡

𝑘,𝑡 𝑘,𝑡 1 ∑︁ 
𝑘,𝑡

𝑘,𝑡 

𝐻𝑖,1 = E  ∇𝐹𝑖 (𝜃 ) − ∇𝐹 𝑗 (𝜃 ) + ∇𝐹𝑖 (𝜃 𝑖 ) − ∇𝐹𝑖 (𝜃 ) + ∇𝐹 𝑗 (𝜃 ) − ∇𝐹 𝑗 (𝜃 𝑗 ) 

 𝑛 𝑘 𝑛 𝑘 𝑗 ∈N 
 𝑗 ∈N𝑘 𝑘
 
2 ∑︁ h 𝑘,𝑡 𝑘,𝑡 2
i
≤ E ∇𝐹𝑖 (𝜃 ) − ∇𝐹 𝑗 (𝜃 ) (from (100))
𝑛 𝑘 𝑗 ∈N
𝑘
 2 

 1 ∑︁   
+ 2E  ∇𝐹𝑖 (𝜃 𝑖𝑘,𝑡 ) − ∇𝐹𝑖 (𝜃 𝑘,𝑡 ) + ∇𝐹 𝑗 (𝜃 𝑘,𝑡 ) − ∇𝐹 𝑗 (𝜃 𝑘,𝑡 )
 
 𝑛 𝑘 𝑗 ∈N 𝑗 
 𝑘

 

 2 

2

𝑘,𝑡 𝑘,𝑡 1 ∑︁ 
𝑘,𝑡

𝑘,𝑡 

≤ 2𝛾𝐺 + 2E  ∇𝐹𝑖 (𝜃 𝑖 ) − ∇𝐹𝑖 (𝜃 ) + ∇𝐹 𝑗 (𝜃 ) − ∇𝐹 𝑗 (𝜃 𝑗 )  (from Lemma 3)

 𝑛 𝑘 𝑗 ∈N 
 𝑘
 
 
2 1 ∑︁ 2
2
 𝑘,𝑡 𝑘,𝑡 𝑘,𝑡 𝑘,𝑡 
≤ 2𝛾𝐺 + 4E  ∇𝐹𝑖 (𝜃 𝑖 ) − ∇𝐹𝑖 (𝜃 ) +
 ∇𝐹 𝑗 (𝜃 ) − ∇𝐹 𝑗 (𝜃 𝑗 ) 
 𝑛 𝑘 𝑗 ∈N 
 𝑘 
(using Cauchy-Schwarz inequality similar to (118))
 2 1 ∑︁ 2 
2 2
 𝑘,𝑡 𝑘,𝑡 𝑘,𝑡 𝑘,𝑡 
≤ 2𝛾𝐺 + 4𝐿 𝐹 E  𝜃 𝑖 − 𝜃 +
 𝜃 − 𝜃 𝑗 
 𝑛 𝑘 𝑗 ∈N 
 𝑘 
2
≤ 2𝛾𝐺 + 8𝐿 2𝐹 𝑎 𝑡 . (117)
• Upper Bound of 𝐻𝑖,2 : Due to Cauchy-Schwarz inequality, the following holds
2
𝑘 +1 𝑛𝑘 +1 𝑛𝑘 +1
𝑛∑︁
© ∑︁ 2 ª © ∑︁ 2 ª

𝑥 𝑦
𝑗 𝑗 ≤ ­ 𝑥 𝑗 ® ­ 𝑦𝑗 ® (118)
𝑗=1 𝑗=1 𝑗=1
« ¬« ¬
24

where 𝑥 𝑗 = √1𝑛𝑘 and 𝑦 𝑗 = √1 (∇𝐹 𝑗 (𝜃 𝑘,𝑡 ˜ 𝑘,𝑡 ˜ 𝑘,𝑡 𝑘,𝑡


𝑛𝑘 𝑗 ) − ∇𝐹 𝑗 (𝜃 𝑗 )) for 1 ≤ 𝑗 ≤ 𝑛 𝑘 ; 𝑥 𝑛𝑘 +1 = 1 and 𝑦 𝑛𝑘 +1 = ∇𝐹𝑖 (𝜃 𝑖 ) − ∇𝐹𝑖 (𝜃 𝑖 ).
Thus, we have
 2 2 
˜ 𝑖 (𝜃 𝑘,𝑡 ) − ∇𝐹𝑖 (𝜃 𝑘,𝑡 ) + 1
 ∑︁
𝐻𝑖,2 ≤ 2E  ∇𝐹 (𝜃
∇𝐹 𝑗 𝑗
𝑘,𝑡
) − ˜
∇𝐹 𝑗 (𝜃 𝑘,𝑡 
) 
𝑖 𝑖 𝑛 𝑘 𝑗 ∈N 𝑗
 𝑘

 
2
≤ 4𝜎𝐹 (using Lemma 2)

where 𝜎𝐹 = max𝑖 ∈N {𝜎𝐹𝑖 }.


Based on the above results, substituting (116) into (115), we obtain
 
2 1 
𝑎 𝑡+1 = (1 + 𝜖)𝑎 𝑡 + 2𝛽 1 + max 𝐻𝑖,1 + 𝐻𝑖,2
𝜖 𝑖 ∈N𝑘
    
1 1  2 
≤ 1 + 𝜖 + 16𝛽2 𝐿 2𝐹 1 + 𝑎 𝑡 + 4𝛽2 1 + 𝛾𝐺 + 2𝜎𝐹2 . (119)
𝜖 𝜖
Note that (119) is essentially the same as [9, Equation (78)]. Therefore, due to 𝛽 ∈ [0, 1/(10𝐿 𝐹 𝜏)) and [9, Corollay F.2.], we
have

E[k𝐶𝑖 k 2 ] ≤ 𝑎 𝑡 ≤ 35𝛽2 𝑡𝜏(𝛾𝐺


2
+ 2𝜎𝐹2 ). (120)

Similar to (106), substituting (111) in E[𝐺 𝑘,𝑡 ], we obtain


2 
𝐿 𝐹 𝛽2 1 ∑︁ ˜
 
𝑘,𝑡
 𝛽 ∑︁ 𝑘,𝑡 > ˜ 𝑘,𝑡 𝑘,𝑡 
E[𝐺 ] = E   ∇𝐹 (𝜃 ) ∇𝐹𝑖 (𝜃 ) − ∇𝐹𝑖 (𝜃 ) 
 𝑛 𝑘 𝑖 ∈N𝑘 2 𝑛 𝑘 𝑖 ∈N 
 𝑘 
2 
2

𝛽 ∑︁   > 𝐿 𝛽 1 ∑︁ 
= E  𝐴𝑖 + 𝐵𝑖 + 𝐶𝑖 + ∇𝐹 ˜ 𝑖 (𝜃 𝑘,𝑡 ) ∇𝐹 ˜ 𝑖 (𝜃 𝑘,𝑡 ) − 𝐹 ˜ 𝑖 (𝜃 𝑘,𝑡 ) 
∇𝐹
 𝑛 𝑘 𝑖 ∈N𝑘
𝑖 2 𝑛 𝑘 𝑖 ∈N 
 𝑘 
2 

 𝛽 ∑︁
 2  𝐿 𝛽2 1 ∑︁ 
𝐴𝑖> ∇𝐹
˜ 𝑖 (𝜃 𝑘,𝑡 ) + 𝐵𝑖> ∇𝐹 ˜ 𝑖 (𝜃 𝑘,𝑡 ) + 𝐶𝑖> ∇𝐹˜ 𝑖 (𝜃 𝑘,𝑡 ) + ∇𝐹 ˜ 𝑖 (𝜃 ) − ˜ 𝑖 (𝜃 𝑘,𝑡 ) 
𝑘,𝑡 𝐹
= E  ∇𝐹


 𝑛 𝑘 𝑖 ∈N𝑘
𝑖 2 𝑛 𝑘 𝑖 ∈N 
" 𝑘 
 h 
𝛽 ∑︁ i h i h i 2
=E E 𝐴𝑖> ∇𝐹˜ 𝑖 (𝜃 𝑘,𝑡 ) N𝑘 + E 𝐵𝑖> ∇𝐹 ˜ 𝑖 (𝜃 𝑘,𝑡 ) N𝑘 + E 𝐶𝑖> ∇𝐹 ˜ 𝑖 (𝜃 𝑘,𝑡 ) N𝑘 + ∇𝐹 ˜ 𝑖 (𝜃 𝑘,𝑡 )
𝑖
𝑛 𝑘 𝑖 ∈N
𝑘
2 
𝐿 𝐹 𝛽2 1 ∑︁ ˜ 
− ∇𝐹𝑖 (𝜃 𝑘,𝑡 )  (using the tower rule)


2 𝑛 𝑘 𝑖 ∈N 
𝑘 
" √︂ h
𝛽 ∑︁ i √︂ h 2 i
√︂ h i √︂ h i
≥E − E k 𝐴𝑖 k N𝑘 2 ˜
E ∇𝐹𝑖 (𝜃 ) N𝑘 − E k𝐵𝑖 k N𝑘
𝑘,𝑡 2
E ∇𝐹 ˜ 𝑖 (𝜃 𝑘,𝑡 ) 2 N𝑘
𝑛 𝑘 𝑖 ∈N
𝑘
√︂ h !#
i √︂ h 2 i

𝐿𝐹 𝛽
 
1 ∑︁ ˜ 2 
2 ˜ 𝑖 (𝜃 𝑘,𝑡 ) N𝑘 𝑘,𝑡
− E k𝐶𝑖 k N𝑘 E ∇𝐹 +𝛽 1− E ∇𝐹
𝑖 (𝜃 )
2 𝑛 𝑘 𝑖 ∈N
𝑘
(using (105) and (100), and rearranging terms)
"  
1 ∑︁ 𝐿 𝐹 𝛽 ˜ 2
≥ 𝛽E 1− ∇𝐹𝑖 (𝜃 𝑘,𝑡 )

𝑛 𝑘 𝑖 ∈N 2
𝑘
√︃  √︂ h !#
√︃ i
2
(1 + 𝛼𝐿) 2 𝛾𝐺 + 𝛼𝜁 𝛾 𝐻 + 𝜎𝐹𝑖 + 𝛽 35𝑡𝜏(𝛾𝐺

− 2 + 2𝜎 2 ) E ∇𝐹˜ 𝑖 (𝜃 𝑘,𝑡 ) N𝑘

.
𝐹

(from (112), (113), and (120))


Therefore, the following holds
" 𝜏−1 #
  ∑︁
𝑘 𝑘+1 𝑘,𝑡 𝑘,𝑡+1
E 𝐹 (𝜃 ) − 𝐹 (𝜃 ) =E 𝐹 (𝜃 ) − 𝐹 (𝜃 )
𝑡=0
𝜏−1
∑︁  
≥ E 𝐺 𝑘,𝑡
𝑡=0
25

" 𝜏−1   √︂ h !#
𝛽 1 ∑︁ ∑︁ ˜ 2 𝜆2 2 i
≥ E 𝑘,𝑡
∇𝐹𝑖 (𝜃 ) − 2 𝜆1 + √ E ∇𝐹˜ 𝑖 (𝜃 ) N𝑘
𝑘,𝑡 (substituting (15))
2 𝑛 𝑘 𝑖 ∈N 𝑡=0 𝐷𝑖
𝑘

where
√︃ √︃
𝜆1 ≥ (1 + 𝛼𝐿) 2 𝛾𝐺 + 𝛼𝜁 𝛾 𝐻 + 𝛽𝜏 35(𝛾𝐺
2 + 2𝜎 2 )
𝐹 (121)
√︃
2 1 + (𝛼𝐿) 2 (𝛼𝜎 ) 2 + (1 + 𝛼𝐿) 2 + 3(𝛼𝜁 𝜎 ) 2 .
 
𝜆2 ≥ 6𝜎𝐺 𝐻 𝐻 (122)
We complete the proof.

B. Proof of Corollary 2
The result of Corollary 2 can be obtained via combining [9, Lemma H.1] and [9, Lemma H.2] with the proof of Lemma 1.

You might also like