Professional Documents
Culture Documents
How Robust Is Federated Learning To Communication Error? A Comparison Study Between Uplink and Downlink Channels
How Robust Is Federated Learning To Communication Error? A Comparison Study Between Uplink and Downlink Channels
Abstract—Because of its privacy-preserving capability, feder- wireless FL. To exploit this opportunity, we need to understand
ated learning (FL) has attracted significant attention from both the tolerance of FL to BER, which unfortunately remains
academia and industry. However, when being implemented over unknown. To fill this research gap, in this paper, we investigate
wireless networks, it is not clear how much communication error
can be tolerated by FL. This paper investigates the robustness the robustness of FL to BER that exists in both the uplink and
of FL to the uplink and downlink communication error. Our downlink communication.
theoretical analysis reveals that the robustness depends on two The contributions of this paper are as follows:
critical parameters, namely the number of clients and the 1) We investigate the robustness of FL to BER in the uplink
numerical range of model parameters. It is also shown that and downlink. After theoretical analyses, we identify the crit-
the uplink communication in FL can tolerate a higher bit error
rate (BER) than downlink communication, and this difference is ical parameters that contribute to the error-tolerating capacity.
quantified by a proposed formula. The findings and theoretical 2) Based on the theoretical results, we find that the FL uplink
analyses are further validated by extensive experiments. is more tolerant to BER than the downlink because of model
Index Terms—Federated learning (FL), bit error rate (BER), aggregation and a narrower range of gradients, and we come
uplink and downlink up with a formula to quantify this difference in error tolerance.
3) The critical parameters, the superiority of the uplink in
I. I NTRODUCTION BER tolerance, and the formula to quantify the difference are
Federated learning (FL) is a distributed machine learning all validated by experiments. The validated theoretical results
scheme that can train a global model based on local data will be useful in the design of wireless FL.
at different clients, without violating their data-privacy re- The remainder of this paper is organized as follows. In
quirements [1]. However, because frequent exchanges of deep Section II, we describe the system model that captures the
neural network (DNN) models between the clients and server BER in the uplink and downlink of FL. In Section III, we
over noisy wireless channels are needed, the communication present theoretical analyses to reveal what contributes to the
bottleneck becomes one of the biggest challenges for FL [2]. BER tolerance in the downlink and uplink. In Section IV, we
Existing works address the communication challenge in provide experimental results to validate our theory. Finally, we
FL through techniques such as model quantization [3]–[5], conclude our paper in Section V.
sparsification [6]–[8], and low-rank approximation [9], [10],
with an assumption of error-free communication channels. II. S YSTEM M ODEL
However, communication error is inevitable and will signif- In this section, we first briefly introduce FL, and then model
icantly affect both learning and communication performance how communication error affects it.
of FL. While some previous research has tried to address this
issue by focusing on the impact of communication error, it has A. Federated Learning
been limited to analog communication [11]–[13]. Some recent We consider a horizontal FL system that comprises one
studies like [5], [14], and [15] have investigated FL in digital server and n clients (e.g., mobile devices), which collaborate to
communication, considering digital modulation or quantizing train a DNN model through multiple communication rounds.
model parameters into bit streams. However, the impact of At each round, the global model is broadcasted to clients,
communication error on FL over digital communication net- updated on the local data, and then transmitted to the server
works has not been fully understood. to produce a new global model by aggregation.
To overcome the communication error and achieve a low Specially, federated learning solves the following optimiza-
bit error rate (BER), high transmitting power and channel tion problem [17]:
decoding effort are usually required. However, due to the Xn
min f (w) = min pi fi (w), (1)
inherent error tolerance of DNN [16], FL may absorb a w w i=1
d
certain BER without learning accuracy degradation, which where w ∈ R represents the DNN model, with d denoting
indicates an opportunity to reduce the energy consumption in its dimension; pi is the ratio of data size at the ith device to
Fig. 1: FL over noisy wireless channels.
Fig. 2: The uplink error can be averaged in the server.
that of all devices; and fi (w) denotes the local loss function
of the ith device. A. Intuitive Analysis
B. Federated Learning with Communication Error Intuitively, we expect that the uplink can tolerate a higher
BER than the downlink, for two reasons. First, for a given
As depicted in Fig. 1, both the uplink and downlink
BER in the downlink, the local training steps on a noisy
communication channels between the clients and the server
model may propagate the errors. However, for the uplink, the
are noisy, which introduces communication error to model
effect of the given BER may be alleviated by the aggregation
parameters. Denoting the global model at the server in the mth
operation. As shown in Fig.2, due to the randomness, the error
communication round as wm , the received model at mobile
bits in different clients are likely to appear in different DNN
devices for the first step of local training is a noisy version of
positions. When aggregated on the server side, the error value
the global model, and can be given as
appearing in the given position may be averaged by the number
i ′
wm,0 = wm , (2) of clients. Therefore, the error is reduced in the uplink.
′ Second, considering a typical FL system that broadcasts
where wm denotes the global model with BER impact.
Then the ith client conducts τ steps of local stochastic model weights in the downlink and transmits model updates
gradient descent (SGD) with a learning rate η. The local in the uplink, the range of the model weights is usually larger
updating rule is given by than the range of model updates. The BER appearing in the
i i i parameters with a larger range will translate into a bigger error
wm,t+1 = wm,t − η∇fi (wm,t ), (3) in magnitude, so the given BER may have a severer impact
i
where t = 0, ..., τ −1, and∇fi (wm,t ) represents the stochastic on the downlink.
gradient computed from the local training datasets. Upon
completing τ steps of local training, the ith client obtains a
i
new model wm,τ . The local model update can be obtained as B. Theoretical Analysis
i i i
∆wm = wm,τ − wm,0 . (4)
To theoretically analyze the BER tolerance, we list the
Then, the local model update is transmitted to the server following basic assumptions [3], [18]:
through the noisy uplink communication channel. Again, what Assumption 1. The loss function fi is L-smooth with re-
the server receives is a noisy version of the local model update, spect to w, i.e., for any w and ŵ ∈ Rd , we can have
i ′
given by (∆wm ) . Upon receiving the noisy local updates ||∇fi (w) − ∇fi (ŵ)|| ≤ L||w − ŵ||.
from the mobile devices, the server performs aggregation to Assumption 2. Stochastic gradients ∇f ˜ i (w) are unbiased
obtain an updated global model using the following equation: ˜
and variance bounded, i.e., Eξ [∇fi (w)] = ∇fi (w), and
Xn ˜ i (w) − ∇fi (w)||2 ] ≤ σ 2 , where ξ denotes the mini-
wm+1 = wm + pi (∆wm i ′
). (5) Eξ [||∇f
i=1
batch dataset, and σ 2 denotes the variance.
The above process iterates until the algorithm converges.
Using the above assumptions, we can provide the conver-
gence bound for wireless FL with a given BER in the uplink
III. BER T OLERANCE : U PLINK VS . DOWNLINK
or downlink. To investigate the extreme BER tolerance for one
In this section, we first intuitively analyze the difference given link, we assume no error for the other link.
in BER tolerance between the downlink and uplink. Then we Theorem 1. Considering the wireless FL system introduced
provide theoretical analyses which can explain the difference in Section II with non-convex loss functions and a total of K
and also quantify the BER tolerance with parameters. communication rounds, the convergence bound of FL with a
(a) Range of downlink w (b) Range of uplink ∆w
(a) FL with downlink errors: (b) FL with uplink errors:
Increasing the number of Increasing the number of Fig. 4: The range of model parameters for MNIST experiment.
clients does not help training. clients improves training.
Fig. 3: How the learning performance with downlink or uplink
bit errors is affected by the number of clients.
downlink BER is
K−1 τ −1
1 XX
E∥∇f (w̄m,t )∥2
Kτ m=0 t=0
K−1
2Ld X
≤ BERm · range2 (wm )
3Kτ η m=0 (a) FL with downlink BER (b) FL with uplink BER
∗ 2
2(f (w0 ) − f ) L (n + 1)(τ − 1)η σ 2 2
Lησ 2 Fig. 5: MNIST experiment: downlink tolerates BER of 10−4
+ + + , (6) while uplink can tolerate 10−1 .
Kτ η n n
and the convergence bound of FL with an uplink BER is
K−1 τ −1
1 XX the impact of the uplink BER can be reduced by averaging. In
E∥∇f (w̄m,t )∥2
Kτ m=0 t=0 contrast, for the downlink, the impact of the downlink BER
K−1 has no relation to n.
Ld X X
i Remark 2 (Quantifying the difference in BER tolerance).
≤ BERm · range2 (∆wm
i
)
3n2 Kτ η m=0 To quantify the difference in BER tolerance between the uplink
i∈[n]
and downlink, we can derive the required BERs for a given
2(f (w0 ) − f ∗ ) L2 (n + 1)(τ − 1)η 2 σ 2 Lησ 2
+ + + , (7) learning performance. Making the RHS of Eq.(6) equal to the
Kτ η n n RHS of Eq.(7), we get
i
i
2
where range() is equal to the maximum value in the model BERm range(∆wm )
BERm = . (8)
subtracting the minimum value; f ∗ denotes the final converged 2n range(wm )
training loss; BERm and BERm i
are the BERs in the This indicates that the difference in BER tolerance between
downlink and uplink communication, respectively. the downlink and uplink can be quantified by a function
Proof. See Appendix. of the number of clients n and the range of communicated
On the right-hand side (RHS) of Eq.(6) or Eq.(7), the first parameters, which is also consistent with our intuitive analysis
term captures the impact of the downlink BER or uplink BER. that the range of transmitted parameters matters. If considering
The second term captures the distance between the initial and transmitting the model weights rather than the model updates
the final training loss. The third term shows the deviation of in the uplink, the Eq.(8) will become
i
i
2
each local model from the overall average model, and the BERm range(wm )
BERm = . (9)
fourth term comes from the variance of the local gradient 2n range(wm )
estimator. It is worth noting that the only difference between From the theory, the uplink can tolerate a bigger BER
downlink-noisy and uplink-noisy FL is the first term. than the downlink, and its superiority can be quantified by
Remark 1 (The effect of the number of clients). The impact the formula in Eq.(8) or Eq.(9), depending on whether model
of BER on FL convergence can be examined by looking at the updates or model weights are transmitted in the uplink. These
first term on the RHS of both Eq.(6) and Eq.(7). We observe analyses can be utilized to design wireless FL. For example,
thatP for the uplink, the impact of the BER is related to n, i.e., the clients can reduce the uplink transmitting power and the
1
n2 i∈[n] . According to the law of large numbers, the overall server needs to increase the downlink transmitting power to
effect is close to the impact of uplink BER being divided by n. meet the different BER requirements. And Eq.(8) and Eq.(9)
This observation is consistent with our intuitive analysis that can be used to estimate the required power.
(a) Range of downlink w (b) Range of uplink ∆w (a) MNIST (b) Fashion-MNIST
Fig. 6: The range of model parameters in Fashion-MNIST. Fig. 8: For different communication rounds, the range of model
weights in uplink is always similar to that in downlink.