FL-ICML21 Paper 7

Handling Both Stragglers and Adversaries for Robust Federated Learning
Jungwuk Park * 1 Dong-Jun Han * 1 Minseok Choi 2 Jaekyun Moon 1
Abstract 2019a; van Dijk et al., 2020). Especially in FedAsync (Xie

et al., 2019a), the global model is updated asynchronously
While federated learning (FL) allows efficient according to the device’s staleness t − τ , the difference
model training with local data at edge devices, between the current round t and the past round τ when
among major issues still to be resolved are: slow the device first received the global model. While the asyn-
devices known as stragglers and malicious attacks chronous schemes are highly effective in handling stragglers,
launched by adversaries. While the presence of its one-by-one update scheme does not lend itself well for
both of these issues raises serious concerns in integration with established aggregation methods to combat
practical FL systems, no known schemes or com- the second issue, the adversaries.
binations of schemes effectively address them at
the same time. We propose Sageflow, staleness- There are different forms of adversarial attacks that degrade
aware grouping with entropy-based filtering and performance of current FL systems: untargeted attacks such
loss-weighted averaging, to handle both stragglers as model-update poisoning (Blanchard et al., 2017) or data
and adversaries simultaneously. Model grouping poisoning (Biggio et al., 2012; Liu et al., 2017), and tar-
and weighting according to staleness (arrival de- geted attacks (or backdoor attacks) (Chen et al., 2017a; Bag-
lay) provides robustness against stragglers, while dasaryan et al., 2018). Robust federated averaging (RFA)
entropy-based filtering and loss-weighted averag- of (Pillutla et al., 2019), a well-known method proposed to
ing, working in a complementary fashion at each handle adversaries in FL, employs geometric-median-based
grouping stage, counter a wide range of adversary aggregation to provide a fair level of protection. Other vari-
attacks. Extensive experimental results show that ous aggregation schemes (e.g., Multi-Krum) also success-
Sageflow outperforms various existing methods fully handle adversaries in distributed learning (Blanchard
aiming to handle stragglers/adversaries. et al., 2017; Chen et al., 2017b). Unfortunately, however,
the performance of these methods are substantially degraded
when the portion of adversaries is large. The presence of
stragglers drives the attack ratio higher (e.g., by ignoring
1. Introduction stragglers), significantly degrading the performance of cur-
Federated learning (FL) (McMahan et al., 2017; Konečnỳ rent aggregation schemes.
et al., 2016) is a promising direction for large-scale learning, While the presence of both stragglers and adversaries raises
which enables training of a global model with less privacy significant concerns in current FL, to our knowledge, there
concerns. However, among major issues that need to be are currently no existing methods or combinations of ideas
addressed in current FL systems are the devices called strag- that can effectively handle these issues simultaneously.
glers that are considerably slower than the average and the
Main contributions. We propose Sageflow, staleness-
adversaries that enforce a various form of attacks.
aware grouping with entropy-based filtering and loss-
Regarding the first issue, simply waiting for all the stragglers weighted averaging, a robust FL strategy which can handle
can significantly slow down the training process of FL. To both stragglers and adversaries at the same time. Targeting
handle stragglers, various asynchronous FL schemes have the straggler issue, our strategy is to perform periodic global
been proposed in the literature (Li et al., 2019b; Xie et al., aggregation while allowing the results sent from stragglers
to be aggregated in later rounds. In each global round, we
*Equal contributions. 1 School of Electrical Engineering, Korea
Advanced Institute of Science and Technology (KAIST), South take advantage of the results sent from stragglers by first
Korea. 2 Department of Telecommunication Engineering, Jeju grouping the models that come from the same initial models
National University, South Korea. (i.e., same staleness), to obtain a group representative model.
Then, we aggregate the representative models of all groups
This work was presented at the International Workshop on Feder- based on their staleness, to obtain the global model. Our
ated Learning for User Privacy and Data Confidentiality in Con-
junction with ICML 2021 (FL-ICML’21). Copyright 2021 by the grouping strategy based on staleness is not only effective in
author(s). neutralizing stragglers but also provides a great platform for
countering adversaries, as discussed below. (independent, identically distributed) data distribution, RFA
Targeting each grouping stage of our straggler-mitigating of (Pillutla et al., 2019) utilizes the geometric median of
idea, we propose an intra-group defense strategy which is models sent from devices, similar to (Chen et al., 2017b).
based on our entropy-based filtering and loss-weighted aver- However, as already implied, these methods are ineffective
aging. These two methods work in a highly complementary when combined with a straggler-mitigation scheme (e.g.,
fashion to effectively counter a wide range of adversarial ignoring stragglers), potentially degrading the performance
attacks in each grouping stage. Here, in computing the en- of learning. Compared to Multi-Krum and RFA, our en-
tropy and loss of each received model, we utilize public tropy/loss based scheme can tolerate adversaries even with
data that we assume to be available at the server. In fact, a high attack ratio, showing remarkable advantages when
the utilization of public data is not a new idea, as seen in combined with straggler-mitigation schemes.
recent FL setups of (Zhao et al., 2018; Xie et al., 2019c; Finally, we note that Zeno (Xie et al., 2019b) and Zeno+
Li et al., 2020). This is generally a reasonable setup since (Xie et al., 2019c) also utilize public data at the server to
data centers typically have some collected data that can be handle adversaries, but in a distributed learning setup with
accessed by public. We show later via experiments that IID data across the nodes. Compared to Zeno+, our Sage-
only a very small amount of public data is necessary at the flow targets non-IID data distribution in a FL setup. While
server (2% of the entire dataset, which is comparable to Zeno+ adopts a fully asynchronous update rule (without
the amount of local data at a single device) to successfully considering the staleness) with the loss-based score func-
combat adversaries. Our main contributions are as follows: tion, our Sageflow integrates staleness-aware grouping, a
• We propose Sageflow, handling both stragglers and ad- semi-synchronous straggler-handling method, with entropy
versaries simultaneously in FL, via a novel staleness- filtering and loss-weighted averaging, a harmonized means
aware grouping combined with entropy-based filter- to provide protection against a wider variety of attacks.
ing and loss-weighted averaging.
2. Proposed Sageflow for Federated Learning
• We derive the theoretical bound for Sageflow and
provide key insights into the convergence behavior. We consider the following federated optimization problem:
N
• Experimental results show that Sageflow outperforms X mk
w∗ = argmin F (w) = argmin Fk (w), (1)
various combinations of straggler/adversary defense w w m
k=1
methods using only a small portion of public data.
Related works. The authors of (Li et al., 2019b; Wu et al., where N is the number of devices, mP k is the number of
N
2019; Xie et al., 2019a; Lu et al., 2020; van Dijk et al., 2020) data samples in device k, and m = k=1 mk . By let-
have recently tackled the straggler issue in FL. The basic ting xk,j be the j-th data sample in device k, the local
idea is to allow the devices and the server to update the lossPfunction of device k, Fk (w), is written as Fk (w) =
1 mk
models asynchronously; the global model is updated every mk j=1 `(w; xk,j ).
time the server receives a model from a device. However, a
2.1. Staleness-Aware Grouping against Stragglers
critical issue is that grouping-based (or aggregation-based)
methods designed to handle adversaries, such as RFA (Pil- In the t-th global round, the server sends the current model
lutla et al., 2019), Multi-Krum (Blanchard et al., 2017) or wt to K devices in St , a randomly selected subset of N de-
the presently proposed entropy/loss based idea, are hard to vices. We let C = K N be the ratio of devices that participate
be implemented in conjunction with these schemes since in each global round. Each device in St performs E local
the model update is performed one-by-one asynchronously. updates and sends the updated model back to the server.
Compared to the asynchronous schemes, our staleness- In handling slow devices, our idea assumes periodic global
aware grouping can be combined smoothly with various aggregation at the server. At each global round t, the server
aggregation rules against adversaries; we can apply RFA, transmits the current model and time stamp, (wt , t), to
Multi-Krum or our entropy/loss based idea in each grouping the devices in St . Instead of waiting for all devices in
stage to obtain the group representative model. St , the server aggregates the models that arrive by a fixed
To combat adversaries, various aggregation methods have time deadline Td to obtain wt+1 , and moves on to the next
been proposed in a distributed learning setup with IID data global round t + 1. Hence, model aggregation is performed
across nodes (Yin et al., 2018a;b; Chen et al., 2017b; Blan- periodically with every Td . A key feature here is that we do
chard et al., 2017). The authors of (Chen et al., 2017b) sug- not ignore the results sent from stragglers not arriving by
gests a geometric-median-based aggregation rule of models the deadline Td . These results are utilized at the next global
or gradients. In Multi-Krum (Blanchard et al., 2017), among aggregation step, or even later, depending on the delay or
(t)
N workers in the system, the server tolerates f Byzantine staleness. Let Ui be the group of devices selected from the
workers, where 2f + 2 < N . Targeting FL with non-IID server at global round t that successfully sent their results
… … Server
…
…
…
Received at Loss weighted
global round 0 averaging
… Entropy
based Averaging
filtering
Received at
global round t-1 …
Received at
global round t
(0) (1) (t)

Figure 1. Overall procedure of Sageflow. At global round t, each received model belongs to one of the t + 1 sets: Ut , Ut , ..., Ut .
(i) (i)
After entropy-based filtering, the server performs loss-weighted averaging for the results that belong to Ut (Eth ) to obtain vt+1 . Then
(i)
we aggregate {vt+1 }ti=0 with wt to obtain wt+1 , and move on to the next round t + 1.
to the server at global round i for i ≥ t. Then, we can among all devices and the selected active devices can ignore
(t) (t) (t)
write St = ∪∞ i=t Ui , where Ui ∩ Uj = ∅ for i 6= j. the current request of the server. The left-hand side of Fig.
…
(t) 1 describes our staleness-aware grouping method.

Here, U∞ can be viewed as the devices that are selected at 𝒰
round t but failed to successfully send their results back to

2.2. Entropy-based Filtering and Loss-Weighted
the server. According to these notations, the devices whose
Received at global
𝒰
Averaging against Adversaries

training results arriveround
at the
t-1 server during global round t
(0) (1)
belong to one of the t + 1 groups: Ut , Ut , ..., Ut . Note
(t) Now we propose a solution against adversaries which not
(i) only shows better performance with attacks but also com-
that the result sent from deviceReceived k ∈ atUglobal
t is the model after
round t bines well with our staleness-aware grouping scheme com-
E local updates starting from wi , and we denote this model
pared to existing aggregation methods to handle adversaries.
by wi (k). At each round t, we first perform FedAvg as
We provide the following two solutions which can protect
(i)
X mk the system in a highly complementary fashion against vari-
vt+1 = P wi (k) (2)
(i) k∈U
(i) mk ous attacks using only a small amount of public data.
k∈Ut t
1) Entropy-based filtering. Let npub be the number of

(i) public data samples in the server. We also let xpub,j be
for all i = 0, 1, ..., t. Here, vt+1 can be viewed as a group
representative model, which is the aggregated result of lo- the j-th sample in the server. When the server receives the
cally updated models (starting from wi ) received at round t locally updated models from the devices, it measures the
with staleness t−i+1. Then from the representative models entropy of each device k by utilizing the public data as
(0) (1) (t) npub
of all t groups, vt+1 , vt+1 ,..., vt+1 , we take a weighted av- 1 X
Pt (i) (i) E(k) = Ex (k), (4)
eraging according to P
different staleness: i=0 αt (λ)vt+1 . npub j=1 pub,j
(i) mk
(i) k∈U
Here, αt (λ) ∝ (t−i+1) t
λ is a normalized coefficient where Expub ,j (k) is the shannon entropy of the model of the
that is inversely proportional to (t − i + 1)λ , for a given k-th device on the sample xpub,j written as Expub,j (k) =
staleness exponent λ ≥ 0. Hence, we have a larger weight PQ (q) (q)
− q=1 Pxpub,j (k) log Pxpub,j (k). Here, Q is the number
(i)
for vt+1 with a smaller t − i + 1 (staleness). This is to (q)
of classes of the dataset and Pxpub,j (k) is the probability of
give more weights to more recent results having smaller prediction for the q-th class on a sample xpub,j , using the
Pt (i) (i)
staleness. Based on the weighted sum i=0 αt (λ)vt+1 , model of the k-th device. As can be seen from Fig. 2(a)
we finally obtain wt+1 as with FMNIST dataset, under specific model-update poison-
t
X (i) (i) ing attacks (reverse sign attack with scale 0.1 in this case),
wt+1 = (1 − γ)wt + γ αt (λ)vt+1 , (3)
the models compromised by adversaries tend to predict ran-
i=0
domly for all classes and thus have high entropies compared
where γ is the time-average coefficient. Now we move on to the models of benign devices. Based on this observation,
to the next round t + 1, where the server selects St+1 and we let the server to filter out the models that have entropies
sends (wt+1 , t + 1) to these devices. Here, if the server greater than some threshold value Eth . We note that the
knows the set of active devices (which are still performing inflicted models cannot be filtered out based on the loss
computation), St+1 can be constructed to be disjoint with values in the setting of model update poisoning in Fig. 2(b),
the active devices. If not, the server randomly chooses St+1 which confirms the importance of the role of entropy. We
25 25
0.5
2 20 20 Adversarial device
Adversarial device 0.4 Benign device
Entropy
Entropy
1.5 Adversarial device 15 Benign device 15
Loss
Loss
Benign device 0.3
1 10 10
0.2
0.5 5 5
0.1 Adversarial device
0 Benign device
0 0 0
20 40 60 80 100 0 50 100 0 20 40 60 0 20 40 60
Global round Global round Global round Global round
(a) Model poison, entropy (b) Model poison, loss (c) Data poison, entropy (d) Data poison, loss
Figure 2. Model update poisoning ((a), (b)): We can filter out the models of adversaries via entropy, but not via loss. Data poisoning
((c), (d)): We can reduce the impact of adversaries via loss, but not via entropy.
also note that our entropy-based filtering is robust against grouping stage of our straggler-mitigating staleness-aware
attacks even with a large portion of adversaries, since it aggregation. The details of overall Sageflow operation are
just filters out the results whose entropy is greater than Eth . described in Algorithm 1 and Fig. 1.
This is a significant advantage compared to current aggre- We stress that Sageflow performs well with only a very small
gation methods against adversaries, whose performance are amount of public data at the server, 2% of the entire dataset,
substantially degraded with high attack ratio. which is comparable to the amount of local data at a single
2) Loss-weighted averaging. The server also measures the device. The computational complexity of Sageflow depends
loss of each model using the public data. Based on the loss on the number of received models at each global round, and
values, the server aggregates the models as follows: the running time for computing the entropy/loss of each
X (k) model. Assuming that the complexity of computing entropy
wt+1 = βt (δ)wt (k), where (5) or loss is linear to the number of model parameters as in
tk∈S (Xie et al., 2019b), Sageflow has larger complexity than
(k) mk X (k)
RFA by the factor npub . This small additional computation
βt (δ) ∝ δ
and βt (δ) = 1. (6)
{Fpub (wt (k))} of Sageflow compared to RFA is the cost for considerably
k∈St
better robustness against adversaries.
Here, wt (k) is the locally updated model of the k-th de-
vice at global round t. Fpub (wt (k)) is defined as the aver- 3. Convergence Analysis
aged loss of wt (k) computed
Pnpubon public data at the server, In this section, we provide insights into the convergence of
1
i.e., Fpub (wt (k)) = npub j=1 `(wt (k); xpub,j ). Finally, Sageflow with the following standard assumptions in FL (Li
δ
δ(≥ 0) in {Fpub (·)} is a loss exponent related to the impact et al., 2019a; Xie et al., 2019a).
of the loss measured with public data. We note that a setting
Assumption 1 The global loss fuction F defined in (1) is
δ = 0 in (5) reduces our loss-weighted averaging to FedAvg.
µ-strongly convex, i.e., F (x) ≤ F (y) + ∇F (x)T (x − y) −
Under the data poisoning or model replacement backdoor µ 2
attacks (or scaled backdoor attack) in (Bagdasaryan et al., 2 kx − yk for all x, y. Moreover, F is L-smooth, i.e.,
F (x) ≥ F (y) + ∇F (x)T (x − y) − L2 kx − yk2 for all x, y.
2018), the models of adversaries have relatively larger losses
compared to others. Fig. 2(d) shows an example with FM- Assumption 2 Let wti (k) be the model of the k-th be-
NIST dataset under data poisoning attack. By the definition nign device after i local updates starting from global
(k)
of βt (δ), the data-poisoned model would be given a small round t. Let ξti (k) be a set of data samples that are ran-
weight and has a less impact on the next global update. By domly selected from the device k for (i + 1)-th local up-
replacing FedAvg with the above loss-weighted averaging, date. Then, Ek∇Fk (wti (k), ξti (k)) − ∇F (wti (k))k2 ≤
we are able to build a system that is highly robust against ρ1 holds f or all t and k = 1, . . . , N and i = 1, . . . , E.
data poisoning and scaled backdoor attacks. As can be seen (i) (i)
Let Bt and Mt be the set for benign and adversarial
from Fig. 2(c), the impact of data poisoning cannot be re- (i) (i) (i) (i)
duced via entropy measures. This indicates that the loss devices of Ut respectively, satisfying Ut = Bt ∪ Mt
(i) (i) (i)
measure has its own unique role, along with the entropy. and Bt ∩ Mt = ∅. Similarly, define Bt (Eth ) and
(i)
Mt (Eth ) as the sets obtained after entropy-based filtering
Entropy-based filtering and loss-weighted averaging can (i) (i)
be easily combined, and work complementarily to tackle is applied to Bt and Mt . Now we have the following
model/data poisoning and scaled backdoor attacks. definition which describes the effect of the adversaries.
(i)
2.3. Sageflow Definition 1 We define Ωt (Eth , δ) as
(i) (k)
X
Finally we put together Sageflow, which can handle both Ωt (Eth , δ) = βi (δ)[F (wt (k)) − F (w∗ )].
stragglers and adversaries at the same time by applying (i)
k∈Mt (Eth )
entropy-based filtering and loss-weighted averaging in each (7)
Algorithm 1 Proposed Sageflow Algorithm

Input: Initialized model w0 , Output: Final global model wT
Process at the Server
1: for each global round t = 0, 1, ..., T − 1 do
2: Choose St and send the current model and the global round (wt , t) to the devices
3: Wait for Td and then:
4: for i = 0, 1, ..., t do
(i) (i)
5: Ut (Eth ) = {k ∈ Ut |E(k) < Eth } // Entropy-based filtering in each group
(i) P (k)
6: vt+1 = k∈U (i) (E ) βt (δ)wi (k) // Loss-weighted averaging in each group (with same staleness)
t th
7: end for
(i) (i)
wt+1 = (1 − γ)wt + γ ti=0 αt (λ)vt+1 // Averaging of representative models (with different staleness)
P
8:
9: end for
Process at the Device: Device k receives (wt , t) from the server and performs local updates to obtain wt (k). Then each benign device k
sends (wt (k), t) to the server, while a malicious adversary sends a poisoned model depending on the type of attack.
(i)
Now we state the following theorem which provides the (with high entropies) in each group Ut (Eth ), which in
convergence bound of Sageflow. turn reduces Ωit according to (7). Likewise, by introducing
(k)
Theorem 1 Suppose Assumptions 1, 2 hold and the learn- δ, we can reduce the loss-weights βi (δ) (defined in (6))
ing rate η is set to be less than L1 . Then Sageflow satisfies of the adversaries with large losses, which again reduces
(i)
Ωt according to (7). In the next section, we show via ex-
E[F (wT ) − F (w∗ )] ≤ ν T [F (w0 ) − F (w∗ )] periments that Sageflow in fact successfully combats both
+ (1 − ν T )Z(λ, Eth , δ) (8) stragglers and adversaries simultaneously and achieves fast
convergence with a small error term.
where ν = 1 − γ + γ(1 − ηµ)E ,
4. Experiments
ρ1 + 2µGmax (λ) + 2µΩmax (Eth , δ)
Z(λ, Eth , δ) = , We validate Sageflow on MNIST (LeCun et al., 1998), FM-
2ηµ2 NIST (Xiao et al., 2017) and CIFAR10 (Krizhevsky et al.,
(9)
Pt−1 (i) (i) 2009). A simple convolutional neural network (CNN) with 2
Gmax (λ) = max i=0 αt (λ)et , Ωmax (Eth , δ) = convolutional layers and 2 fully connected layers is utilized
1≤t≤T
(i) (i)
max Ωt (Eth , δ), et = F (wi ) − F (wt ). for MNIST, while CNN with 2 convolutional layers and 1
t,i≤t fully connected layer is used for FMNIST. When training
In (8) above, we see a trade-off between ν T , which de- with CIFAR10, we utilized VGG-11.
termines the convergence speed, and (1 − ν T )Z, which We consider N = 100 devices each having the same number
represents an error. If we increase γ, we have a higher of data samples. We randomly assigned two classes to
convergence speed (i.e., a small ν T ) but a larger error each device as in (McMahan et al., 2017) to create non-IID
term. Our scheme allows a separate control on the error situations. We ignored the batch normalization layers when
term Z(λ, Eth , δ) through staleness exponent λ, entropy- training VGG-11 with CIFAR10. At each global round, we
filtering threshold Eth and the loss-weighted exponent δ. randomly selected a fraction C of devices in the system to
Here, in Z(λ, Eth , δ) of (9), Gmax (λ) is the error term participate. The portion of adversaries at each global round
caused by stragglers controlled by λ, and Ωmax (Eth , δ) is r. For the proposed Sageflow method, we sample 2% of
is the errror caused by adversaries controlled by Eth and the entire training data uniformly at random to be the public
δ. First, to gain insights on Gmax (λ), assume that the data and performed FL with the remaining 98% of the train
loss of the global model decreases as global round in- set. The number of local epochs at each device is set to 5.
creases, i.e., F (wi ) < F (wj ) for i > j. Then, we have
(0) (1) (t−1) Comparisons schemes. Under the existence of both strag-
et > et > ..., > et > 0. In order to reduce
Pt−1 (i) (i) glers and adversaries, we compare our Sageflow with vari-
α
i=0 t (λ)e t , we have to increase the staleness expo- ous combinations of straggler/adversary mitigating schemes.
(i)
nent λ to increase αt (λ) for a large i (weight for the group Targeting stragglers, we consider the following methods.
(i)
with small staleness) while reducing αt (λ) for a small i First is the wait for stragglers approach where FedAvg is
(weight for the group with large staleness). By choosing an applied after waiting for all the devices at each global round.
appropriate λ, we can control the error term Gmax while The second scheme is the ignore stragglers approach where
taking advantage of the results sent from stragglers. As FedAvg is applied after waiting for a certain timeout thresh-
for the adversary-induced error Ωmax (Eth , δ), by imposing old and ignore the results sent from slow devices. The third
a threshold Eth , we can reduce the portion of adversaries scheme is the wait for a percantage of stragglers where
100 70
80
60
80
Test accuracy
Test accuracy
Test accuracy
50
Sageflow (Ours) 60 Sageflow (Ours)
60 FedAsync + eflow FedAsync + eflow
Sageflow (Ours) 40 Sag + RFA
Sag + RFA
FedAsync + eflow
Ignore stragglers + RFA 40 Ignore stragglers + RFA
Sag + RFA 30
40 Wait for stragglers + RFA Wait for stragglers + RFA
Ignore stragglers + RFA
Wait for partial stragglers + RFA Wait for partial stragglers + RFA
Wait for stragglers + RFA
Zeno+ 20 Zeno+
20 Wait for partial stragglers + RFA
20 Zeno+
10
0 50 100 150 0 50 100 150 0 200 400 600 800 1000 1200
Running time Running time Running time
(a) MNIST (b) FMNIST (c) CIFAR10

Figure 3. Model-update poisoning attack: Sageflow outperforms various combinations aiming to handle stragglers/adversaries.
Table 1. Performance at a specific time in Fig. 3.
Model update poisoning Data poisoning Scaled backdoor attack
(Test accuracy) (Test accuracy) (Attack success rate)
Methods\Datasets MNIST FMNIST CIFAR10 MNIST FMNIST CIFAR10 MNIST FMNIST CIFAR10
Zeno+ 9.96% 9.95% 9.21% 12.27% 10.00% 10.00% - - -
Ignore stragglers + RFA 10.04% 10.00% 10.00% 97.38% 80.78% 64.51% - - -
Wait for partial stragglers + RFA 78.29% 35.29% 10.00% 97.59% 70.15% 62.13% 32.16% 98.49% 82.26%
Wait for stragglers + RFA 96.20% 70.01% 17.17% 96.60% 80.01% 58.51% 3.17% 42.79% 81.74%
Sag + RFA 95.78% 73.64% 10.00% 97.60% 83.55% 64.18% 99.97% 99.97% 90.79%
FedAsync + eflow 84.7% 66.38% 53.03% 83.47% 48.80% 51.28% 99.99% 99.86% 91.34%
Sageflow (Ours) 97.27% 85.23% 63.76% 97.38% 85.01% 64.87% 0.21% 3.79% 5.56%
FedAvg is applied after waiting for a specific portion of the tence of both stragglers and adversaries. We set C = 0.2,
selected devices in each global round. We consider a scheme r = 0.2 for model/data poisoning and C = 0.1, r = 0.1 for
that waits for 50% of selected devices. Finally, we consider the backdoor attack. In backdoor attack, we excluded the
FedAsync (Xie et al., 2019a) where the global model is up- results for Zeno+ and RFA combined with ignore stragglers,
dated every time the result of each device arrives. Targeting since the models are not trained at all. We have the follow-
adversaries, we consider both RFA (Pillutla et al., 2019) ing observations. First, Zeno+ does not perform well since
and an asynchronous version of Zeno+ (as in (Xie et al., it does not take both the staleness and entropy into account.
2019c)) which utilizes public data at the server: the server It can be also seen that the wait for stragglers scheme com-
first subtracts each survived model (after filtering) from the bined with RFA suffers from the straggler issue. Our next
previous global model to obtain the model difference. Then, observation is that the RFA combined with ignore stragglers
the global model is updated asynchronously based on each method exhibits poor performance. The reason is that the
model difference. For a fair comparison, we let 2% of the attack ratio could often be very high (larger than r) for this
train set to be the public data and the remaining 98% to be deadline-based scheme, which degrades the performance
distributed at the devices, as in our Sageflow. of RFA. Due to the same issue, RFA combined with our
Setup. The global aggregation at the server is performed staleness-aware grouping (Sag) has performance degrada-
with every Td = 1 periodically for Sageflow, ignore strag- tion. FedAsync does not perform well when combined with
glers and FedAsync. To model stragglers, each device can entropy-based filtering and loss-weighted averaging (eflow),
have delay of 0, 1, 2 global rounds which is determined since the model update is conducted one-by-one in the order
independently and uniformly random. For model update of arrivals. Due to the same issue, FedAsync cannot be com-
poisoning, each adversary sends −0.1w to the server, in- bined with RFA. Overall, the proposed Sageflow performs
stead of sending the true model w. For data poisoning at- the best, confirming significant advantages of our scheme
tack, we conduct label-flipping (Biggio et al., 2012), where under the existence of both stragglers and adversaries.
each label i is flipped to label i + 1. For the backdoor, we
use the model replacement method (scaled backdoor attack) 5. Conclusion
(Bagdasaryan et al., 2018) in which adversaries transmit the We proposed Sageflow, a robust FL scheme handling both
scaled version of the corrupted model to replace the global stragglers and adversaries simultaneously. The staleness-
model with a bad model. We conduct pixel-pattern back- aware grouping allows the server to effectively utilize the
door attack (Gu et al., 2017) in which the specific pixels are results sent from stragglers. The grouping strategy also inte-
embedded in a fraction of images, where these images are grates naturally with defenses against adversaries. In each
classified as a targeted label. We applied backdoor attack grouping stage of our straggler-mitigating idea, entropy-
in every global round after the 10-th round for MNIST and based filtering and loss-weighted averaging function in a
FMNIST, and after the 1000-th round for CIFAR10. complementary fashion to protect the system against a wide
Results. Fig. 3 and Table 1 show the results under the exis- variety of adversarial attacks. Theoretical convergence anal-
ysis provides key insights into why Sageflow works well. Lu, X., Liao, Y., Lio, P., and Hui, P. Privacy-preserving
Experimental results show that Sageflow enables robust FL asynchronous federated learning mechanism for edge
in practical scenarios with both stragglers and adversaries. network computing. IEEE Access, 8:48970–48981, 2020.
McMahan, B., Moore, E., Ramage, D., Hampson, S., and
References y Arcas, B. A. Communication-efficient learning of deep
Bagdasaryan, E., Veit, A., Hua, Y., Estrin, D., and networks from decentralized data. In Artificial Intelli-
Shmatikov, V. How to backdoor federated learning. arXiv gence and Statistics, pp. 1273–1282, 2017.
preprint arXiv:1807.00459, 2018.
Pillutla, K., Kakade, S. M., and Harchaoui, Z. Ro-
Biggio, B., Nelson, B., and Laskov, P. Poisoning at- bust aggregation for federated learning. arXiv preprint
tacks against support vector machines. arXiv preprint arXiv:1912.13445, 2019.
arXiv:1206.6389, 2012. van Dijk, M., Nguyen, N. V., Nguyen, T. N., Nguyen, L. M.,
Blanchard, P., Guerraoui, R., Stainer, J., et al. Machine Tran-Dinh, Q., and Nguyen, P. H. Asynchronous feder-
learning with adversaries: Byzantine tolerant gradient ated learning with reduced number of rounds and with
descent. In Advances in Neural Information Processing differential privacy from less aggregated gaussian noise.
Systems, pp. 119–129, 2017. arXiv preprint arXiv:2007.09208, 2020.
Chen, X., Liu, C., Li, B., Lu, K., and Song, D. Targeted Wu, W., He, L., Lin, W., Jarvis, S., et al. Safa: a semi-
backdoor attacks on deep learning systems using data asynchronous protocol for fast federated learning with
poisoning. arXiv preprint arXiv:1712.05526, 2017a. low overhead. arXiv preprint arXiv:1910.01355, 2019.
Chen, Y., Su, L., and Xu, J. Distributed statistical machine Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: a
learning in adversarial settings: Byzantine gradient de- novel image dataset for benchmarking machine learning
scent. Proceedings of the ACM on Measurement and algorithms. arXiv preprint arXiv:1708.07747, 2017.
Analysis of Computing Systems, 1(2):44, 2017b. Xie, C., Koyejo, S., and Gupta, I. Asynchronous federated
optimization. arXiv preprint arXiv:1903.03934, 2019a.
Gu, T., Dolan-Gavitt, B., and Garg, S. Badnets: Identify-
ing vulnerabilities in the machine learning model supply Xie, C., Koyejo, S., and Gupta, I. Zeno: Distributed stochas-
chain. arXiv preprint arXiv:1708.06733, 2017. tic gradient descent with suspicion-based fault-tolerance.
In International Conference on Machine Learning, pp.
Konečnỳ, J., McMahan, H. B., Yu, F. X., Richtárik, P.,
6893–6901. PMLR, 2019b.
Suresh, A. T., and Bacon, D. Federated learning: Strate-
gies for improving communication efficiency. arXiv Xie, C., Koyejo, S., and Gupta, I. Zeno++: Robust fully
preprint arXiv:1610.05492, 2016. asynchronous sgd. arXiv preprint arXiv:1903.07020,
2019c.
Krizhevsky, A., Hinton, G., et al. Learning multiple layers
of features from tiny images. 2009. Yin, D., Chen, Y., Ramchandran, K., and Bartlett, P.
Byzantine-robust distributed learning: Towards optimal
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-
statistical rates. arXiv preprint arXiv:1803.01498, 2018a.
based learning applied to document recognition. Proceed-
ings of the IEEE, 86(11):2278–2324, 1998. Yin, D., Chen, Y., Ramchandran, K., and Bartlett, P. De-
fending against saddle point attack in byzantine-robust
Li, Q., He, B., and Song, D. Model-agnostic round-optimal
distributed learning. arXiv preprint arXiv:1806.05358,
federated learning via knowledge transfer. arXiv preprint
2018b.
arXiv:2010.01017, 2020.
Zhao, Y., Li, M., Lai, L., Suda, N., Civin, D., and Chandra,
Li, X., Huang, K., Yang, W., Wang, S., and Zhang, Z. On V. Federated learning with non-iid data. arXiv preprint
the convergence of fedavg on non-iid data. arXiv preprint arXiv:1806.00582, 2018.
arXiv:1907.02189, 2019a.
Li, Y., Yang, S., Ren, X., and Zhao, C. Asynchronous

federated learning with differential privacy for edge intel-
ligence. arXiv preprint arXiv:1912.07902, 2019b.
Liu, Y., Ma, S., Aafer, Y., Lee, W.-C., Zhai, J., Wang, W.,
and Zhang, X. Trojaning attack on neural networks. 2017.
A. Proof of Theorem 1
A.1. Additional Notations for Proof
Let wtj (k) be the model of the k-th benign device after j local updates starting from global round t. At global round t, each
device receives the current global model wt and round index (time stamp) t from the server, and sets its initial model to
wt , i.e., wt0 (k) ← wt for all k = 1, . . . , N . Then each k-th benign device performs E local updates of stochastic gradient
descent (SGD) with learning rate η:
wtj (k) ← wtj−1 (k) − η∇Fk (wtj−1 (k), ξtj−1 (k)) f or j = 1, . . . , E, (10)
where ξtj (k)is a set of data samples that are randomly selected from the k-th device during the j-th local update at
global round t. After E local updates, the k-th benign device transmits wtE (k) to the server. However, in each round, the
adversarial devices transmit poisoned model parameters.
Using these notations, the parameters defined in Section 2 can be rewritten as follows:
(i)
X (k) (k) mk X (k)
vt+1 = βi (δ)wiE (k) where βi (δ) ∝ and βi (δ) = 1 (11)
(i)
{Fpub (wiE (k))}δ (i)
k∈Ut (Eth ) k∈Ut (Eth )
t
P t
X (i) (i) (i) k∈Ut
(i) mk X (i)
zt+1 = αt (λ)vt+1 where αt (λ) ∝ and αt (λ) = 1 (12)
i=0
(t − i + 1)λ i=0
wt+1 = (1 − γ)wt + γzt+1 (13)
We also define
(i) (k)
X
Γt−1 = βi (δ) (14)
(i)
Bt−1 (Eth )
(i)
where 0 ≤ Γt−1 ≤ 1.
A.2. Key Lemma and Proof

We introduce the following key lemma for proving Theorem 1. A part of our proof is based on the convergence proof of
FedAsync in (Xie et al., 2019a).
Lemma 1 Suppose Assumptions 1, 2 hold and the learning rate η is set to be less than L1 . Consider the k-th benign device
that received the current global model wt from the server at global round t. After E local updates, the following holds:
Eρ1 η
E[F (wtE (k)) − F (w∗ )|wt0 (k)] ≤ (1 − ηµ)E [F (wt0 (k)) − F (w∗ )] + . (15)
2
Proof of Lemma 1. First, consider one step of SGD in the k-th local device. For a given wtj (k), for all global round t and for
all local updates j ∈ {0, 1, . . . , E − 1}, we have
E[F (wtj+1 (k)) − F (w∗ )|wtj (k)]

≤ F (wtj (k)) − F (w∗ ) − ηE[∇F (wtj (k))T ∇Fk (wtj (k), ξtj (k))|wtj (k)]
Lη 2
+ E[k∇Fk (wtj (k), ξtj )k2 |wtj (k)] I SGD update and L-smoothness
2
η
≤ F (wtj (k)) − F (w∗ ) + E[k∇F (wtj (k)) − ∇Fk (wtj (k), ξtj (k))k2 |wtj (k)]
2
η j 1
− k∇F (wt (k))k2 Iη<
2 L
η ηρ1
≤ F (wtj (k)) − F (w∗ ) − k∇F (wtj (k))k2 + I Assumption 2
2 2
ηρ1
≤ (1 − ηµ)[F (wtj (k)) − F (w∗ )] + I µ-strongly convexity (16)
2
Applying above result to E local updates in k-th local device, we have
E F (wtE (k)) − F (w∗ )|wt0 (k)

= E[ E[F (wtE (k)) − F (w∗ )|wtE−1 (k)]|wt0 (k) ] I Law of total expectation
ηρ1
≤ (1 − ηµ)E[[F (wtE−1 (k)) − F (w∗ )]|wt0 (k)] + I Inequality (16)
2
..
.
E
ηρ1 X
≤ (1 − ηµ)E [F (wt0 (k)) − F (w∗ )] + (1 − ηµ)j−1
2 j=1
ηρ1 1 − (1 − ηµ)E 1 1
= (1 − ηµ)E [F (wt0 (k)) − F (w∗ )] + I From η < ≤ , ηµ < 1
2 ηµ L µ
Eηρ 1
≤ (1 − ηµ)E [F (wt0 (k)) − F (w∗ )] + I From ηµ < 1, 1 − (1 − ηµ)E ≤ Eηµ
2
A.3. Proof of Theorem 1

Now utilizing Lemma 1, we provide the proof for Theorem 1. First, consider one round of global aggregation at the server.
For a given wt−1 , the server updates the global model according to equation (13). Then for all t ∈ 1, . . . , T , we have
E[F (wt ) − F (w∗ )|wt−1 ]

(a)
≤ (1 − γ)[F (wt−1 ) − F (w∗ )] + γE[F (zt ) − F (w∗ )|wt−1 ]
(b) t−1
(i) (i)
X
∗
≤ (1 − γ)[F (wt−1 ) − F (w )] + γ αt−1 (λ)E[F (vt ) − F (w∗ )|wt−1 ]
i=0
(c) t−1
(i) (k)
X X
≤ (1 − γ)[F (wt−1 ) − F (w∗ )] + γ αt−1 (λ) βi (δ)E[F (wiE (k)) − F (w∗ )|wt−1 ]
i=0 (i)
k∈Ut−1 (Eth )
t−1 n
(i) (k)
X X
= (1 − γ)[F (wt−1 ) − F (w∗ )] + γ αt−1 (λ) βi (δ)E[F (wiE (k)) − F (w∗ )|wt−1 ]
i=0 (i)
k∈Bt−1 (Eth )
o
(k)
X
+ βi (δ)E[F (wiE (k)) − F (w∗ )|wt−1 ]
(i)
k∈Mt−1 (Eth )
(d) Eηρ1 γ
(t−1) (t−1)
≤ (1 − γ + γαt−1 (λ)Γt−1 (1 − ηµ)E )[F (wt−1 ) − F (w∗ )] +
2
t−2
(i) (k)
X X
E
βi (δ) F (wi0 (k)) − F (w∗ )

+ γ(1 − ηµ) αt−1 (λ)
i=0 (i)
k∈Bt−1 (Eth )
t−1
(i) (k)
X X
+γ αt−1 (λ) βi (δ)E[F (wiE (k)) − F (w∗ )|wt−1 ]
i=0 (i)
k∈Mt−1 (Eth )
(e) Eηρ1 γ
(t−1) (t−1)
≤ (1 − γ + γαt−1 (λ)Γt−1 (1 − ηµ)E )[F (wt−1 ) − F (w∗ )] +
 2 
t−2
(i) (k)
X X
+ γ(1 − ηµ)E αt−1 (λ) βi (δ)  F (wi ) − F (w∗ )
 

| {z }
i=0
(i)
k∈Bt−1 (Eth ) F (wi )−F (wt−1 )+F (wt−1 )−F (w∗ )
+ γΩmax (Eth , δ)
t−1
X (i) (i) Eηρ1 γ
= (1 − γ + γ αt−1 (λ)Γt−1 (1 − ηµ)E )[F (wt−1 ) − F (w∗ )] +
i=0
2
t−2
(i) (k)
X X
E
+ γ(1 − ηµ) αt−1 (λ) βi (δ) [F (wi ) − F (wt−1 )] + γΩmax (Eth , δ)
i=0 (i)
k∈Bt−1 (Eth )
(f ) Eηρ1 γ
≤ (1 − γ + γ(1 − ηµ)E )[F (wt−1 ) − F (w∗ )] +
2
+ γGt−1 (λ) + γΩmax (Eth , δ) (17)
Pt−1 (i) (i)

where Gt (λ) := i=0 αt (λ)et and we define G0 (λ) = 0. (a), (b), (c) come from convexity, (d) follows Lemma 1, (e)
(i) (i) (i)
comes from Ωmax = max Ωt . (f ) comes from the fact that ηµ < 1 and 0 ≤ αt (λ) ≤ 1 and 0 ≤ Γt ≤ 1 for all
0≤i≤t,0≤t≤T
Pt (i)
i, t and i=0 αt (λ) = 1 for all t.
Applying the above result to T global aggregations in the server, we have

E[F (wT ) − F (w∗ )|w0 ]
(a)
= E [E[F (wT ) − F (w∗ )|wT −1 ]|w0 ]
(b) γ(Eηρ1 + 2GT −1 (λ) + 2Ωmax (Eth , δ))
≤ E (1 − γ + γ(1 − ηµ)E )[F (wT −1 ) − F (w∗ )]|w0 +

2
(c) γ(Eηρ + 2G (λ) + 2Ω (Eth , δ))
1 T −1 max
≤ (1 − γ + γ(1 − ηµ)E )T [F (w0 ) − F (w∗ )] +
2
T −1
X γ(Eηρ1 + 2GT −1−τ (λ) + 2Ωmax (Eth , δ))
+ (1 − γ + γ(1 − ηµ)E )τ
τ =1
2
(d)
≤ (1 − γ + γ(1 − ηµ)E )T [F (w0 ) − F (w∗ )]
Eηρ1 + 2Gmax (λ) + 2Ωmax (Eth , δ)
+ 1 − {1 − γ + γ(1 − ηµ)E }T

2(1 − (1 − ηµ)E )
(e)
≤ (1 − γ + γ(1 − ηµ)E )T [F (w0 ) − F (w∗ )]
ρ1 + 2µGmax (λ) + 2µΩmax (Eth , δ)
+ 1 − {1 − γ + γ(1 − ηµ)E }T

2ηµ2
T ∗ T
= ν [F (w0 ) − F (w )] + (1 − ν )Z(λ, Eth , δ)
which completes the proof. Here, (a) comes from the Law of total expectation, (b), (c) are due to inequality (17). (d) is
Pt−1 (i) (i)
obtained from the definition of Gmax (λ) := max i=0 αt (λ)et . In addition, (e) is from ηµ ≤ 1.
1≤t≤T
B. Additional experimental results

B.1. Performance comparison with Multi-Krum
While we compared Sageflow with RFA in our main manuscript, here we compare our scheme with Multi-Krum (Blanchard
et al., 2017) which is a Byzantine-resilient aggregation method targeting conventional distributed learning setup with IID
data across nodes. In Multi-Krum, among N workers in the system, the server tolerates f Byzantine workers under the
assumption of 2f + 2 < N . After filtering f worker nodes based on squared-distances, the server chooses M workers
among N − f remaining workers with the best scores and aggregates them. We set M = N − f for comparing our scheme
with Multi-Krum.
100 70 90 70
80
60 60
80 70
Test accuracy
Test accuracy
Test accuracy
Test accuracy
50 60 50
60 Sageflow (Ours)
Sageflow (Ours) Multi-Krum: f=3 50
40 Sageflow (Ours) 40
Multi-Krum: f=3 Multi-Krum: f=4 40 Sag + Multi-Krum (f=max)
40 Multi-Krum: f=4
Ignore stragglers + Multi-Krum (f=max) Sageflow (Ours)
30 30 30
Wait for stragglers + Multi-Krum (f=max) Sag + Multi-Krum (f=max)
20 20 Ignore stragglers + Multi-Krum (f=max)
20 20
10 Wait for stragglers + Multi-Krum (f=max)
0 10 0 10
0 50 100 150 0 200 400 600 800 1000 1200 0 20 40 60 80 100 0 200 400 600 800 1000 1200
Global round Global round Running time Running time
(a) FMNIST, only adversaries (b) CIFAR-10, only adversaries (c) FMNIST, both strag- (d) CIFAR-10, both strag-
glers/adversaries glers/adversaries
Figure 4. Performance comparison with Multi-Krum under model-update poisoning. With only adversaries, Multi-Krum performs well
when an appropriate f parameter value is chosen. However, the performance of Multi-Krum degrades significantly when stragglers exist
(even when combined with straggler-mitigating schemes). This is because the attack ratio can become very high when combined with
staleness-aware grouping or the ignoring stragglers scheme; the number of adversaries exceeds f , significantly degrading the performance
of Multi-Krum. When Multi-Krum is combined with the wait for stragglers scheme, the performance is not degraded by adversaries but
by waiting for slow devices.
Fig. 4 compares Sageflow with Multi-Krum under model-update poisoning with scale 10. The stragglers are modeled with
delay 0, 1, 2. We first observe Figs. 4(a) and 4(b) which show the results with only adversaries. It can be seen that if the
number of adversaries exceed f , the performance of Multi-Krum drops dramatically. Compared to Multi-Krum, the proposed
Sageflow method can filter out the poisoned devices and then take the weighted sum of the survived results even when the
portion of adversaries is high. Figs. 4(c) and 4(d) show the results under the existence of both stragglers and adversaries,
under the model-update poisoning attack. We let C = 0.2 and r = 0.2, and the parameter f of Multi-Krum is set to the
maximum value satisfying 2f + 2 < N , where N depends on the number of received results for both staleness-aware
grouping (Sag) and ignore stragglers approaches. However, even when we set f to the maximum value, the number of
adversaries can still exceed f , which degrades the performance of Multi-Krum combined with staleness-aware grouping
(Sag) or the ignore stragglers approach. Obviously, Multi-Krum can be combined with the wait for stragglers strategy by
setting f large enough. However, this scheme still suffers from the effect of stragglers, which significantly slows down the
overall training process.
B.2. Experimental results with varying hyperparameters

To observe the impact of hyperparameter setting, we performed additional experiments with various δ and Eth values, the
key hyperparameters of Sageflow. The results are shown in Fig. 5 with only adversaries. We performed the data poisoning
attack for varying δ and the model-update poisoning attack with scale 0.1 for varying Eth . We set C = 0.2 and r = 0.2.
First, the results under data poisoning show that the performance of Sageflow is not sensitive to δ if they are chosen in the
appropriate range of [1, 2]. For the model-update poisoning attack, if we use a very small Eth like 0.3, the performance is
poor because a large number of devices get filtered out. If we use a large Eth , the performance is also very poor since the
scheme cannot filter out the adversaries. However, similar to the behavior of hyperparameter δ, we can confirm that our
scheme performs well regardless of dataset if Eth is chosen in an appropriate range of [1, 2].
To summarize, our scheme still performs well (better than RFA), even with coarsely chosen hyperparameter values regardless
of the dataset.
90 100 70 70
80 60 E th =0.3
80
60
E th =0.3
70 E th =0.5
Test accuracy
Test accuracy
Test accuracy
Test accuracy
E th =0.5 50 50
60 60 E th =1
=0 E th =1 40
50 40 E th =1.5
=0.5 E th =1.5 =0
40 =1 40 30 E th =2
E th =2 30 =0.5
=1.5 =1 E th =2.5
30 =3 E th =2.5 20
20 20 =1.5
E th =10
20 E th =10.0 =3
10
10 0 10
0 20 40 60 80 100 0 50 100 150 0 200 400 600 800 1000 1200 0 200 400 600 800 1000 1200
Global round Global round Global round Global round
(a) FMNIST, data poisoning

(b) FMNIST, model update poi- (c) CIFAR-10, data poisoning (d) CIFAR-10, model update poi-
soning soning
Figure 5. Impact of varying hyperparameter values under model-update poisoning and data poisoning attacks. The performance of
Sageflow is not highly sensitive to the exact settings of loss exponent δ and entropy threshold Eth , as long as they are chosen in a
reasonable range.

FL-ICML21 Paper 7

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

FL-ICML21 Paper 7

Uploaded by

Copyright:

Available Formats

Handling Both Stragglers and Adversaries for Robust Federated Learning

Jungwuk Park * 1 Dong-Jun Han * 1 Minseok Choi 2 Jaekyun Moon 1

Abstract 2019a; van Dijk et al., 2020). Especially in FedAsync (Xie

(0) (1) (t)

(t) 1 describes our staleness-aware grouping method.

round t but failed to successfully send their results back to

Averaging against Adversaries

1) Entropy-based filtering. Let npub be the number of

Algorithm 1 Proposed Sageflow Algorithm

(a) MNIST (b) FMNIST (c) CIFAR10

Li, Y., Yang, S., Ren, X., and Zhao, C. Asynchronous

A.2. Key Lemma and Proof

E[F (wtj+1 (k)) − F (w∗ )|wtj (k)]

Applying above result to E local updates in k-th local device, we have

E F (wtE (k)) − F (w∗ )|wt0 (k)

A.3. Proof of Theorem 1

E[F (wt ) − F (w∗ )|wt−1 ]

Pt−1 (i) (i)

Applying the above result to T global aggregations in the server, we have

B. Additional experimental results

B.2. Experimental results with varying hyperparameters

(a) FMNIST, data poisoning

You might also like