Flute: A S, E F H - P F L S: Calable Xtensible Ramework For IGH Erformance Ederated Earning Imulations

FLUTE: A S CALABLE , E XTENSIBLE F RAMEWORK FOR
H IGH -P ERFORMANCE F EDERATED L EARNING S IMULATIONS
Dimitrios Dimitriadis 1 Mirian Hipolito Garcia * 1 Daniel Madrigal Diaz * 1 Andre Manoel * 1 Robert Sim 1
A BSTRACT
In this paper we introduce “Federated Learning Utilities and Tools for Experimentation” (FLUTE), a high-
arXiv:2203.13789v1 [cs.LG] 25 Mar 2022
performance open source platform for federated learning research and offline simulations. The goal of FLUTE
is to enable rapid prototyping and simulation of new federated learning algorithms at scale, including novel
optimization, privacy, and communications strategies. We describe the architecture of FLUTE, enabling arbitrary
federated modeling schemes to be realized, we compare the platform with other state-of-the-art platforms, and we
describe available features of FLUTE for experimentation in core areas of active research, such as optimization,
privacy and scalability. We demonstrate the effectiveness of the platform with a series of experiments for text
prediction and speech recognition, including the addition of differential privacy, quantization, scaling and a variety
of optimization and federation approaches.
1 I NTRODUCTION proposed as a strategy to address these constraints. Feder-

ated learning is a decentralized machine learning scheme
Distributed Training (DT) has drawn much scientific atten- with focus on collaborative training and user data privacy.
tion with focus on scaling the model training processes, The key idea is to enable training of a global model with the
either via model or data parallelism. As training datasets collaboration of multiple participants (clients) coordinated
grow larger, the need for data parallelism has become a by a central server. Each client trains the model using lo-
priority. Different approaches have been proposed over the cal data samples and then sends the tuned parameters back
years (Ben-Nun & Hoefler, 2019), aiming at more efficient to the server, where the aggregated information is used to
training, either in the form of training platforms such as update the global model.
“Horovod” (Sergeev & Bals, 2018; Abadi et al., 2016a) or
algorithmic improvements like “Blockwise Model-Update One of the challenges of using Federated Learning plat-
Filtering” (BMUF) (Chen & Huo, 2016). These techniques forms is the need for scaling the learning process to millions
are evaluated on metrics such as data throughput (without of clients, in order to simulate real-world conditions. As
compromising accuracy), model and/or training dataset size, such, testing and validating any novel algorithm in realistic
and GPU utilization. However, there are a few underly- scenarios, e.g., using real devices or close-to-real scaled
ing assumptions implied for such DT scenarios, i.e., data deployments, is particularly difficult. A simulation platform
and device uniformity and efficient network communica- can play an important role enabling researchers and devel-
tion between the working nodes. Besides the communica- opers to develop proof-of-concept implementations (POCs)
tion/network specifications, data uniformity is paramount and validate their performance before building and deploy-
for successful training, ensured by repeated randomization ing in the wild. While several open-source frameworks have
and data shuffling steps. been developed to enable FL solutions, few offer end-to-end
simulation, experiment orchestration and scalability.
New constraints in data management are emerging, driven
partly by the need for privacy compliance of personal data This paper introduces “Federated Learning Utilities and
and information (Wolford, 2021). As such, increasingly Tools for Experimentation” (FLUTE) as a framework for
more data is stored behind inaccessible firewalls or on users’ running large-scale, offline FL simulations. It is designed to
devices without the option of being shared for centralized be flexible, to enable arbitrary model architectures, and to
training. The “Federated Learning” (FL) paradigm has been allow for prototyping novel approaches in federation, opti-
*
mization, quantization, privacy, and so on. Also, it provides
Equal contribution 1 Microsoft Research, Redmond, US. Cor- an optional integration with AzureML workspaces, enabling
respondence to: Dimitris Dimitriadis <didimit@microsoft.com>,
Robert Sim <rsim@microsoft.com>.
scenarios closer to real-world applications, and leveraging
platform features to manage and track experiments, parame-
ter sweeps, and model snapshots.
FLUTE: A Scalable Federated Learning Simulation Platform
The main contributions of this work are: Unbalanced and/or non-IID data: Local training data
are individually generated according to the client usage,
• Create a platform for high-performance FL simulation e.g., users spending more time on their devices tend to
at scale (scaling to millions of clients), generate more training data than others. Therefore, it is
expected that these locally segregated training sets may not
• Provide flexibility to the platform for research, experi- be either a representative sample of the data distribution or
mentation, and POC development, uniformly distributed between clients. A simple strategy
to overcome communication overheads and non-IID data
• A generic API for new model types and data formats, distributions was proposed with the “Federated Averaging”
(FedAvg) algorithm (McMahan et al., 2017). In this ap-
• A range of experimentation features - state-of-the-art proach, the clients perform several training iterations, and
federation algorithms, optimizers, differential privacy, then send the updated models back to the server for aggre-
quantization, dropout and stale clients, gation based on a weighted average. FedAvg (McMahan
et al., 2017), Section 4.1 is one of the baseline training
• Experimental results illustrating the utility of the plat- strategies for FL, given the simplicity and the consistently
form for FL research, good results achieved in multiple experiments. On the other
hand, FedAvg is neither the only nor the best aggregation
• A competitive analysis and comparisons with some of strategy. Over time, new approaches have emerged to over-
the leading FL simulation platforms. come the limitations of FedAvg, for example, the DGA
algorithm (Dimitriadis et al., 2020a), which proposes an
On the other hand, FLUTE does not address challenges like optimization strategy to address the heterogeneity problems
data collection, secure aggregation, device labelling or attes- on data and devices, as detailed in Section 4.2.
tation. Rather, the goal of FLUTE is to facilitate the study
of new algorithmic paradigms and optimizations, enabling
more effective FL solutions in real-world deployments. Hardware heterogeneity: Computing capabilities of
each client can vary, i.e., CPU, memory, battery level, stor-
The code for the platform is open-source and available at
age are not expected to be the same across all nodes. This
https://github.com/microsoft/msrflute.
can affect the selection and availability of the participat-
ing devices and it can bias the learning process. Different
2 BACKGROUND AND P RIOR W ORK approaches have been proposed to address clients that fall
behind, i.e. “stragglers”, the most popular of them allow-
In general, there are two different approaches concern-
ing for asynchronous updates and client dropouts, as in
ing the architecture of FL systems: either using a central
Section 4.6.
server (Patarasuk & Yuan, 2009), as the “coordinator or or-
chestrator”, or opting for peer-to-peer learning, without the
need of a central server (Liang et al., 2020). FLUTE is based
on the “server-client” architecture. Besides defining the ar- Threats: The interest for applying FL in different scenar-
chitecture, there are some technical challenges that need to ios has increased in recent years as it enables training of
be addressed for a successful implementation, mostly in the machine learning models over scattered data generated by
areas of communication efficiency, optimization and learn- thousands of users, while maintaining their privacy. How-
ing processes (Shamir et al., 2013), and privacy constraints. ever, FL itself cannot completely assure either data privacy,
or robustness to diverse attacks proven to be effective in
Most of these challenges arise because of the distributed breaking privacy. Without any mitigation, both the server
nature of FL and the data segregation: and clients can be attacked by malicious users. For example,
attackers can try to poison the model by sending back to
Communication overhead: FL relies heavily on the com- the server fake model parameters (Zhang et al., 2021) or
munication between server and the clients to complete any fake the server and send a malicious model to the clients
training iteration. The fact that some of the clients and stealing the local information (Enthoven & Al-Ars, 2020).
the server can be in different networks may cause limited To tackle this issue, FL strategies started to incorporate tech-
connectivity, high latency and others. Different approaches niques like Differential Privacy (DP), Section 4.4, which
have been proposed, e.g., gradient quantization and sparsifi- allow us to obfuscate while maximizing the data utility (Wei
cation (McMahan et al., 2017; Jhunjhunwala et al., 2021). et al., 2020) or Multi-Party Computation (MPC), which only
Some of these approaches are supported in FLUTE, while reveals the computation result while maintaining the con-
the impact to overall performance is minimal, as discussed fidentiality of all the intermediate computations (Byrd &
in Section 4.5. Polychroniadou, 2020; Bhowmick et al., 2019).
Simulation and prototyping: Building federated learn- 4. Combine the returned information on the server to
ing solutions can require significant up-front engineering produce a new global model,
investment, often with an unclear or uncertain outcome.
Simulation frameworks enable FL researchers and engineers 5. Optionally, update the global model with an additional
to estimate the potential utility of a particular solution, and server-side rehearsal step,
investigate novel approaches, before making any significant 6. Send the updated global model to the clients,
investments. Recently, several frameworks have been pro-
posed for FL simulations, including TensorFlow Federated 7. Repeat Steps 2 - 6 after sampling a new subset of
(Abadi et al., 2016a) and PySyft (Ziller et al., 2021), with clients for the next training iteration.
each having different focus and different simulation scope
of their proof-of-concept scenarios. We briefly compare 3.2 FLUTE Implementation
the main features of these frameworks alongside FLUTE in
Section 6. The distributed nature of FLUTE is based on
Python/PyTorch, using OpenMPI (Gabriel et al.,
2004) as the backbone for communication. A FLUTE job
3 FLUTE A RCHITECTURE consists of one or more independent nodes, e.g., multi-GPU
In the sections below, we distinguish the logical workflow VMs, running up to K workers, executing tasks assigned by
from the physical implementation of FLUTE so that accu- the server. The tasks typically consist of using clients’ data
rate logical simulation can be implemented as efficiently as to either update a model or compute metrics.
possible in the physical platform. During training, the server has the roles of both orchestrator
and aggregator. First, it distributes client-IDs among the
3.1 Logical Workflow workers. The workers, on their turn, process the clients’
data to produce new models, and send the models back
FLUTE has been designed as a scalable framework for rapid
to the Server. After a number of clients is processed, the
prototyping of FL scenarios. As such, it enables researchers
server will aggregate all the resulting models, typically into
with the flexibility to run large-scale experiments, encour-
a single global model. Algorithm 1 describes in detail this
aging them to propose novel FL solutions that address real-
process and Figure 2 exemplifies the execution flow .
world applications. FLUTE design is based on a central
server architecture, as depicted in Figure 1.
Figure 2. FLUTE Execution Flow: The server samples M of the

clients and sends them to the K workers. Every time one of the
workers finishes processing the client data, it returns the gradient
and draws the next client until all clients are processed. Figure
Figure 1. Logical architecture of the FLUTE platform based on (Bonawitz et al., 2019), which describes a similar process.
The logical workflow performed by FLUTE is: Figure 3 illustrates the communication between “Worker
0/Server” and the remaining workers. The interface between
1. Send an initial global model to clients, Server and Workers is based on “messages”, client-IDs and
training information. In summary, there are four messages
2. Train instances of the global model with locally avail- that can be passed from server to worker:
able data on each client,
3. Send training information, e.g., adapted models, logits, • Update: Creates a copy of the model on the worker.
and/or gradients/pseudo-gradients back to the server, Model is passed from server to worker using MPI.
Algorithm 1 FLUTE Orchestration: P is a Client Pool, to the server or aggregated with other local data sources.
which contains data of each client, N is the federation Only (sometimes encrypted 1 ) models or gradients are com-
rounds to be executed, and K is the number of Workers municated between servers and clients. In this research
1: Server-Side Worker-0: simulator, all clients are implemented as isolated object in-
2: for each federated round from 1,2,.. to N do stances, and never communicate training data (as part of the
3: C ← Sample M clients from P FLUTE design, the server can hold all clients’ data during
4: for client in C do these simulations, to optimize communication overheads –
5: Dispatch model and client to available workerk however, this doesn’t impact the FL assumptions).
6: end for
7: repeat 4 F EATURES OF FLUTE
8: Wait for workerk to finish
9: Save pseudo-gradient response from workerk The main goal of FLUTE is to be a scalable framework for
10: c ← Sample client from P rapid prototyping, encouraging researchers to propose novel
11: Dispatch model and client data c to workerk FL solutions that address real-world applications, covering
12: until all client in P has been processed the following challenges:
13: Aggregate pseudo-gradients
14: Update model with optimization step • Scalability: Capacity to processs many thousands of
15: end for clients on any given round. FLUTE allows to run large-
16: scale experiments using up to 10,000 clients with rea-
17: Client-Side Worker-k > 0: sonable turn-around time, since scale is a critical factor
18: Load client and model data in understanding practical metrics such as convergence
19: Execute Training Procedure and privacy-utility trade-offs.
20: Send pseudo-gradient to Worker-0
• Flexibility: Allow for any combination of model,
21: Send Statistics about local training to Worker-0
dataset and optimizer. FLUTE Support for diverse
FL configurations, including standardized implemen-
tations such as DGA and FedAvg, with Pytorch being
• Train: Triggers the execution of a training step on the framework of choice for implementing the models.
a worker, for a given client. The resulting model (or
pseudo-gradient) is passed from worker to server. • Expandable: Allow the end users to easily plug in
customized/new techniques like differential privacy or
• Evaluate: Triggers the execution of an evaluation gradient quantization. FLUTE provides an open archi-
step in a given client. Resulting metrics are passed tecture allowing users to incorporate new algorithms
from worker to server. in a straightforward fashion.
• Terminate: Shuts down MPI thread where worker
4.1 Federated Learning – FedAvg
has been instantiated.
The “Federated averaging” (FedAvg) algorithm, (Konecny
et al., 2015; McMahan et al., 2017) is the first and perhaps
the most widely used FL training algorithm. The server
samples MT ⊂ N of the available N devices and sends
(s)
the model wT at that current iteration T . Each client j,
(j)
j ∈ MT has a version of the model wT , where it is locally
updated with the segregated local data. The size of the
(j)
available data DT , per iteration T and client j, is expected
(j) (j)
to differ and as such, NT = |DT | is the size of processed
local training samples. After running E steps of SGD, the
(j)
updated model ŵT is sent back to the server. The new
model wT +1 is given by
Figure 3. Client-server communication protocol. 1 (j) (j)
X
wT +1 ← P (j)
NT ŵT (1)
In contrast to distributed training, any end-node such as j∈MT NT j∈MT
smartphones or private Cloud computes can be a client in 1

Encryption and secure aggregation are not currently imple-
an FL application. As such, local data on each client stays mented in FLUTE - these are security mechanisms which aren’t
within the local storage boundaries and it is never either sent strictly necessary for simulation.
The server-side model in iteration T + 1 is a weighted aver- i.e., the contribution of some components is de-emphasized
(j)
age of the locally updated models of the previous iteration. by weighting the local gradients, g̃T , in Eq. 3, Appendix A.
Despite some drawbacks like lack of fine-tuning or anneal- The proposed algorithm is called “Dynamic Gradient Aggre-
ing of a global learning rate, FedAvg is the baseline aggregation” (DGA) (Dimitriadis et al., 2020a). The weighting
gation approach in FLUTE, based on its popularity. process is a type of regularization, decreasing the variance
of the aggregated gradients.
4.2 Adaptive Optimization Two flavors of DGA are herein used: first, the DGASM
using the training losses as weights, and the “data-driven”
The FedAvg algorithm, although the golden standard in FL
approach, DGARL , where a model is trained to infer the
training, presents several drawbacks (Zhao et al., 2018). A
weights, more details in Section 3 in (Dimitriadis et al.,
different family of learning algorithms, “Adaptive Federated
2021). In some tasks where local data distributions are
Optimizers” has been proposed to address them (Reddi et al.,
similar, DGA appears to not significantly affect the overall
2020; Dimitriadis et al., 2020a; Li et al., 2019), where the
performance, such as the LibriSpeech task, Section 5.1 and
clients return pseudo-gradients instead of models.
Fig. 4. Even then, the models converge significantly faster.
The training process consists of two optimization steps: On the other hand, DGA significantly improves both con-
first, on the client-side using a stateless optimizer for lo- vergence speed and the model performance in tasks, e.g.,
cal SGD steps, and then on the server-side with a “global” unsupervised training, where the label quality widely varies
optimizer utilizing the aggregated gradient estimates. The or diverse local data distributions exist, in (Dimitriadis et al.,
two-level optimization provides both speed-ups in conver- 2021).
gence rates due to the second optimizer on the server, and
FLUTE supports any of the adaptive federated optimizers,
improved control over the training by adjusting the learning
but DGA has been proven the most robust and faster (in con-
rates. In addition, scaling in the number of clients becomes
vergence) across a wide range of applications (Dimitriadis
straightforward by adjusting the server-side optimizer. The
et al., 2020a). As such, it is the default optimizer for the
proposed algorithm DGA, (Dimitriadis et al., 2021).
platform.
The FLUTE system provides support for this group of fed-
erated optimizers by adjusting the gradient aggregation 4.4 Differential Privacy
weights, and the server-side optimizers, as in Appendix A
and (Dimitriadis et al., 2020a;b; 2021), making FedYogi, In the federated learning context, differential privacy (DP)
FedAdam (Reddi et al., 2020), and DGA, Section 4.3, rather (Dwork et al., 2014) is typically enforced by clipping the
straightforward to apply. Also, the FLUTE client scaling norm of, and then adding noise to, the gradients produced
capabilities are enhanced by switching to large-batch op- during the training procedure (Abadi et al., 2016b). This
timizers on the server-side, like LAMB (You et al., 2020), can be done either by each client (local DP), or by the server
and LARS (You et al., 2017), validated by the experimental (global DP), depending on the scenario.While there is some
evidence, shown in Section 5. overhead associated to performing DP on the sample level,
one could choose to clip/add noise to each user’s average
gradient instead. That still limits the impact of any given
4.3 Dynamic Gradient Aggregation
client on the learned model.
Heterogeneous local data poses additional challenges, espe-
In FLUTE, either local or global DP can be used, depend-
cially for the aggregation step, as in Eq. 5, Appendix A.
ing on whether the clients or the server are responsible
An adverse effect due to these heterogeneous distributions for doing the clipping and noise-adding. In both cases,
is the gradients can point in different directions and the that is done directly to the pseudo-gradient, i.e., the dif-
aggregation process becomes noisier due to this diversity. ference between current and previous weights after each
All, gradients – based on local training – where losses are user’s data is processed. The pseudo-gradients are re-
of similar magnitude, are assumed to move the model sim- scaled so that their norm is at most C, ensuring the norm
ilarly 2 ; thus, the aggregated gradients are expected to be of the difference between any two of them (the sensitiv-
somewhat aligned. Such alignment of the aggregated gra- ity) is bounded. We typically use Gaussian noise, with
C2
dients is shown to be beneficial both for the convergence variance σ 2 = 2 log 1.25
δ 2 picked so that the aggrega-
speed and performance, shown in Figure 4. Gradients that tion is at most (, δ)-DP w.r.t. each client. In the case of
deviate from the rest should be processed differently: the DGA (Dimitriadis et al., 2020a), the aggregation weights
proposed approach uses weights during the aggregation step, also go through the same procedure, since they are data-
2 dependent.
DGA reduces the variance of the aggregated gradients accord-
ing to the proof found in (Dimitriadis et al., 2021) Local clipping and noise addition can be accumulated as if
it were per-example (i.e. per-client, or per-pseudo-gradient) 4.7 AzureML Integration

clipping and noise addition in global DP-SGD. As such, we
AzureML (AML) (AzureML Team, 2016) is the Azure
track the per-client noise across all clients and apply the
Cloud service designed for staging, executing, tracking, and
Renyi DP accountant globally (Mironov, 2017). Note that
summarizing Machine Learning experiments. AML pro-
tighter bounds have been presented recently in the literature
vides Kubernetes services for running containerized work-
(Gopi et al., 2021) and could improve these results.
flows on targeted computing nodes, including multi-GPU
nodes. FLUTE has a native AML integration included for
4.5 Quantization
job submissions, allowing the users to use the built-in CLI
Gradients produced during training of the neural networks or web interface for job/experiment tracking, and for visual-
are known to be quite redundant, and can be compressed ization purposes. In such case, job submission is handled by
without adversely affecting the optimization procedure, a configuration file containing all the job-related parameters,
e.g., gradient components can be represented by a single e.g., target, cluster, code, while storing the setup related to
bit (Seide et al., 2014), or some of the components can be the experiment on an Azure storage account. The models
discarded, effectively making the gradient sparse (Wangni and logs can be downloaded to a local machine. AML also
et al., 2018). In federated learning, compressing gradients enables tracking all the jobs, allowing to rerun crashed jobs,
leads to decreased bandwidth, which might be important analyze metrics and abort any job in progress.
depending on the model size or the latency of the network. Besides AML, FLUTE also runs seamlessly on stand-alone
On FLUTE, we use an approach similar to that of (Alistarh devices such as laptop and desktop machines, using local
et al., 2017). At each layer of the neural network, we first GPUs when available.
get the dynamic range of the gradient components, and then
create a histogram of 2B bins between these two values. 5 C ASE S TUDIES
Next, we replace each gradient component by the label of
the closest bin. That way, we need just to communicate the This section provides insights by exploring a variety of fea-
bin indices, together with the min. and max. values. This tures of the FLUTE platform. The list of presented datasets,
quantization procedure is done on the client side, meaning tasks and experimental results is by no means exhaustive. In
it applies only to client-to-server communication. this context, we do not present any of the models used since
the platform allows training on any architecture currently
Finally, we also provide the option to sparsify gradients by supported by PyTorch.
keeping only the p% largest (in absolute value) components.
If quantization is also active, binning is done before sparsi-
5.1 Baseline Experiments and Datasets
fication, but the original value of the component is used to
decide whether it is replaced by the bin label or zeroed-out. ASR Task: LibriSpeech FLUTE offers a Speech Recog-
nition template task based on the LibriSpeech task (Panay-
4.6 Dropout and Stale Clients otov et al., 2015). The dataset contains about 1,000 hours
of speech from 2,500 speakers reading books. Each of the
The FLUTE platform offers flexibility when sampling the speakers is labeled as a different client. In one of the ASR
participating clients from the pool of candidates. A range task examples, a sequence-to-sequence model was used for
in the number of clients can be given, fluctuating between training, more details can be found in (Dimitriadis et al.,
the two ends. This can be seen as “client dropouts”, where 2020b).
a number of clients can be randomly discarded. Since the
default optimization pipeline is based on DGA, as in Sec-
tion 4.3, the learning rate can be adjusted accordingly. Computer Vision Task: MNIST and EMNIST Two dif-
ferent datasets, i.e., the MNIST (LeCun & Cortes, 2010)
Similar to the dropout functionality, FLUTE offers an option and the EMNIST (Cohen et al., 2017) dataset are used for
of delaying the contributions of random clients rather than Computer Vision tasks. The EMNIST dataset is a set of
discarding the corresponding gradients. The system can handwritten characters and digits captured and converted
introduce a 1-step “staleness” to the system by randomly to 28 × 28 pixel images maintaining the image format and
delaying a subset of the clients by 1 iteration. The con- data-structure and directly matching the MNIST dataset.
vergence analysis for the stale gradients scenario is held in Among the many splits of EMNIST dataset, we use the
Appendix B. As shown, the error introduced due to staleness “EMNIST Balanced”, containing 1̃32k images with 47 bal-
is upper bounded. As such, there is theoretical guarantee anced classes.
that the model will finally converge. The theoretical conclu-
sions are experimentally verified using FLUTE, shown in
NLP Tasks: Reddit Different NLP tasks are supported in
Figure 5.
FLUTE with 2 use-cases for MLM and next-word prediction
using Reddit data (Baumgartner et al., 2020). The Reddit 5.3 DP Experiments
dataset consists of users’ tweets grouped in months when
In this experiment, we explore privacy-utility trade-offs in
published on the social network. For the use-cases, we use
LM training, using the baseline model described in Sec-
2 months of Reddit data with 2.2M users.The initial/seed
tion 5.1. Local differential privacy was applied with pa-
models used are based either on HuggingFace or a baseline
rameter LDP . We track the per-client noise across all
LM model, as described below.
clients and yield a final global RDP after 500 rounds of
training. We express the privacy-utility tradeoff as the ratio
Sentiment Analysis: sent140, IMDb, YELP of accuracy and RDP . Clearly, for a fixed number of train-
Sent140 (Go et al., 2009), is a sentiment analysis ing rounds, we achieve better privacy and better accuracy by
dataset consisting of tweets, automatically annotated from sampling a larger number of clients. We observe a penalty
the emojis found in them. The dataset consists of 255k of about 4.3% relative to the non-private baseline ( = ∞).
users, with mean length of 3.5 samples per user. IMDb
is based on movie reviews of 1012 users providing 140k Clients/Iter. LDP RDP Acc @1 (%) Tradeoff
reviews with 10 rating classes (Diao et al., 2014). The 10,000 100 0.108 17.20 0.00172
YELP dataset is based on restaurant review from Yelp and 1,000 100 0.333 14.80 0.00148
the sentiment label is from 1 to 5 (Tang et al., 2015). It 10,000 500 1.41 20.00 0.000399
contains 2.5k users with 425k reviews. 1,000 500 12.6 18.00 0.000361
10,000 750 3.65 20.30 0.00027
1,000 750 38.3 18.80 0.000251
Baseline LM Model: A baseline LM model is used for
10,000 1,000 7.66 20.30 0.000203
most of the experiments in Section 5. A two-layer GRU
1,000 1,000 79.7 19.40 0.000194
with 512 hidden units, 10,000 word vocabulary, and embed-
10,000 inf inf 21.70 0.0
ding dimension 160 is used for fine-tuning during the FL
1,000 inf inf 21.40 0.0
experiments. The seed model is pretrained on the Google
News corpus (Gu et al., 2020).
Table 1. Next-word prediction: Results comparing language model
accuracy and privacy for various client sampling and LDP param-
5.2 Convergence Experiments of Federated eters. All experiments trained for 500 rounds. Trade-off is defined
Optimizers as the ratio of accuracy and RDP
We investigate different scenarios for convergence based
on three federated optimizers, i.e. FedAdam, DGASM and 5.4 Quantization Experiment
DGARL , on the Speech Recognition task, as in Section 5.1.
It is shown in literature that FedAdam is much faster than In Table 5.4, the accuracy for a next-word-prediction task
FedAvg in convergence, and as such this experiment is is shown, on the Reddit dataset and the baseline LM model
omitted. The results, as in Figure 4, show a significant described in Section 5.1, while using different values of
speed-up due to DGASM , of about 3× compared with the quantization B. As expected, using less bits, while training
FedAdam. The second approach of DGARL is about 40% for the same number of epochs, leads to decreased perfor-
faster than the DGASM , consistent across various tasks. mance in terms of accuracy.
Quant. (bits) Acc @1 (%) Rel. Improv. (%)

Seed Model N/A 9.83 (56.62)
SST N/A 22.30 (1.59)
FL Training 32 22.70 0
10 22.40 (1.32)
8 22.20 (2.25)
4 21.30 (5.87)
3 18.80 (17.21)
2 17.80 (21.58)
Table 2. Next-word prediction task: Top-1 accuracy after gradient

Figure 4. Convergence of various algorithms on the ASR Lib- quantization. The number of bits per gradient coefficient varies
riSpeech Task. Figure extracted from (Dimitriadis et al., 2020a) 2 − 32.
and produced using an early version of the FLUTE platform. We have also done experiments varying the sparsity level,
while keeping the quantization constant at 8 bits, cf. Table
5.4. In this particular experiment, we had gains in band- scales linearly; FLUTE provides options to speed-up this
width of up to 16x with no significant change in perfor- process, such as processing clients in multiple threads and
mance. Error compensation techniques (Strom, 2015) could pre-encoding the data.
be attempted in order to increase the performance at higher
sparsity levels. The difference in performances for the case Number of Clients Runtime (sec.)
of 8-bit quantization level in Tables 5.4 and 5.4 is due to 1,000 22.1 ± 0.6
noise during the training process. 5,000 111.3 ± 2.4
10,000 219.0 ± 2.3
50,000 1103.7 ± 11.3
% Sparsity Gain in Bandwidth Acc @1 (%)
Table 4. How long it takes for 3 workers to process different num-
0.0 4x 22.60
ber of clients, on a simple NLG experiment using a GRU model
75.0 16x 21.70 and the Reddit dataset. Averages are computed over 20 iterations.
95.0 80x 19.00
99.0 400x 17.70 Table 4 shows that FLUTE scales gracefully the number of
clients per iteration, without any upper bound to that number.
Table 3. Performance obtained by varying sparsity level on gradi- We can also look at the predictive performance attained for
ents while keeping quantization fixed at 8 bits – gains in bandwidth
different numbers of clients, and study how it changes as a
are relative to standard 32 bits gradient. The performance reported
function of the optimizer used.
is the best one over 5000 iterations, with 1000 clients being pro-
cessed at each iteration.
Optimizer Acc @ 1 (%)
5.5 Stale Gradients 1k clients/iter Seed Model 9.80
Adam (Baseline) 22.70
We experimentally verify the theoretical analysis in Ap- RL-based DGA 22.80
pendix B with two different experiments, depending on the 10k clients/iter Adam (Baseline) 20.80
percentage of stale clients, 20% and 50% of the 1000 clients SGD-LARS 17.00
are stale – staleness in this experiment equals to 1 cycle. Adam-LARS 21.40
This experiment is based on the next-word prediction task SGD-LAMB 23.00
using the Reddit dataset, together with the baseline LM Variable number Adam 22.30
model described in Section 5.1. As suggested, the model
still converges to an optimal point in terms of accuracy. Table 5. Next-word Prediction task: Top-1 accuracy achieved vary-
However, it takes longer for the case of 50% to reach a good ing number of clients and optimizers.
point in performance.
In Table 5 , we compare 4 different scenarios for optimizers,
increasing the number of clients, showing that the accu-
racy remains stable for most of them. However, the Adam
optimizer decreases its accuracy as the number of clients
increase, compared to SGD-LAMB that reaches a better
performance with a larger number of clients.
5.7 Comparing Optimizers

This experiment of next-word prediction, using the Reddit
Figure 5. Next-word Prediction task: Top-1 Accuracy for Reddit
dataset and baseline LM model described in Section 5.1,
dataset with Staleness or 1 iteration.
explores model training performance for a variety of state-of-
the-art optimizer choices. We trained a recurrent language
model, fixing the number of clients per round to 1,000, and
5.6 Performance for Variable/Different Number of
varying the choice of optimizer in the central aggregator.
Clients
Specifically, we applied standard SGD (Rosenblatt, 1958),
The number of clients processed at each round is a variable ADAM (Kingma & Ba, 2017), LAMB (You et al., 2020),
we can control on FLUTE. Here, we show how long a and LARS (You et al., 2017). Table 6 illustrates the perfor-
round typically takes using a varying number of clients, on a mance of each optimizer, including maximum validation
simulation with 1 server + 3 workers attached to RTX A6000 accuracy, and convergence rate: the number of rounds to
GPUs and 2.45GHz AMD EPYC cores. Notice that, since reach 95% of the max. accuracy. Note there is no hyper-
clients are processed sequentially by each worker, runtime parameter tuning of the optimizers for this experiment.
Optimizer Acc @1 (%) Convergence Round standard model and dataloader APIs require additional wrap-
LAMB 23.10 115 per code, which is often task dependent, to facilitate easy
ADAM 22.70 641 integration into an FL framework. We will continue our
SGD 20.60 2172 efforts to address these challenges with new releases of the
LARS 17.40 414 platform.
The goal of FLUTE is to enable rapid experimentation and
Table 6. Next-word prediction task: Top-1 Accuracy and training prototyping, and facilitate the development of new FL re-
rounds to 95% convergence for various central optimizer choices.
search efforts. We encourage the research community to
explore new research using FLUTE and invite contributions
to the public source repository.
6 C OMPARISON WITH RELATED
PLATFORMS
R EFERENCES
FLUTE allows customized training procedures and complex
algorithmic implementations, making it a valuable tool to Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean,
rapidly validate the feasibility of novel FL solutions, while J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kud-
avoiding the need to deal with complications that production lur, M., Levenberg, J., Monga, R., Moore, S., Murray,
environments present. D., Steiner, B., Tucker, P., Vasudevan, V., Warden, P.,
Wicke, M., Yu, Y., and Zheng, X. Tensorflow: A Sys-
To this day, different FL platforms have been proposed, tem for Large-Scale Machine Learning. arXiv preprint
however, most of them have been designed with a specific arXiv:1605.08695, 2016a.
purpose which limits the flexibility to experiment with com-
plex FL scenarios. Current FL frameworks include a wide Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B.,
variety of tools, but only a few support customized train- Mironov, I., Talwar, K., and Zhang, L. Deep learning
ing procedures for experimentation environments. Table 8 with differential privacy. In Proceedings of the 2016 ACM
shows the most common FL platforms and their main focus. SIGSAC conference on computer and communications
security, pp. 308–318, 2016b.
Table 7 shows a deeper comparison focusing only on
research-dedicated platforms and their main features. Agarwal, A. and Duchi, J. C. Distributed delayed stochastic
optimization. Advances in Neural Information Processing
In recent years, researchers have made significant efforts to
Systems, pp. 873—-881, 2011.
address the challenges FL raises, especially when it comes
to setting up FL-friendly environments – privacy guarantees, Alistarh, D., Grubic, D., Li, J., Tomioka, R., and Vojnovic,
time-consuming processes, communication costs and be- M. Qsgd: Communication-efficient sgd via gradient quan-
yond. With FLUTE, some of these challenges are addressed tization and encoding. Advances in Neural Information
allowing enhanced customization and enabling new research Processing Systems, 30:1709–1720, 2017.
at realistic scales.
AzureML Team. AzureML: Anatomy of a Machine Learn-
ing service. In Dorard, L., Reid, M. D., and Martin, F. J.
7 C ONCLUSIONS -D ISCUSSION
(eds.), Proc. of The 2nd Intern. Conf. on Predictive APIs
In this paper we have presented FLUTE3 , a versatile, open- and Apps, volume 50 of Proceedings of Machine Learn-
architecture platform for high-performance federated learning Research, pp. 1–13, Sydney, Australia, 06–07 Aug
ing simulation that is available as open source. FLUTE 2016. PMLR. URL https://proceedings.mlr.
provides scaling capabilities, several state-of-the-art feder- press/v50/azureml15.html.
ation approaches and related features such as differential
privacy, and a flexible API enabling extensions and the in- Baumgartner, J., Zannettou, S., Keegan, B., Squire, M., and
troduction of novel approaches. FLUTE is model and task Blackburn, J. The pushshift reddit dataset. In Proceedings
independent, and provides facilities for easy integration of of the international AAAI conference on web and social
new model architectures based on PyTorch. media, volume 14, pp. 830–839, 2020.
In the development of FLUTE we have identified several key Ben-Nun, T. and Hoefler, T. Demystifying Parallel and
challenges in developing high-performance FL simulators. Distributed Deep Learning: An In-depth Concurrency
First, the communication and CPU overhead of managing Analysis. ACM Computing Surveys, 4(65), 2019.
many client tasks and model updates tends to impose a
bottleneck that prevents optimal GPU usage. In addition, Bhowmick, A., Duchi, J., Freudiger, J., Kapoor, G., and
Rogers, R. Protection against reconstruction and its ap-
3
Repository: https://github.com/microsoft/msrflute plications in private federated learning, 2019.
FLUTE Tensor Flow Federated PySyft

Framework PyTorch Tensorflow Pytorch
Communication MPI NCCL Homebrew Protocol
Cloud Integration 3 3 3
Server Side Definitions 3 3 3
Dropout 3 3 3
Support Multiple Federated Optimization Schemes 3 3 7
Scalability 3 3 7
Stale clients 3 7 7
Differential Privacy Built-in Not-Native Built-in
Flexibility Dynamic Static Dynamic
Support Multiple Federated Optimization Techniques 3 3 Limited
Table 7. Comparison between FLUTE and Popular Federated Learning Simulation Platforms
Platform Focus Dimitriadis, D., Kumatani, K., Gmyr, R., Gaur, Y., and
FLUTE FL Research and Simulation Eskimez, S. E. A federated approach in training acoustic
Tensor Flow Federated FL Research and Simulation models. In In Proceedings of Interspeech’20, 2020b.
PySyft Simulation and Privacy
Dimitriadis, D., Kumatani, K., Gmyr, R., Gaur, Y.,
LEAF Datasets and Metrics
and Eskimez, S. E. Dynamic Gradient Aggregation
FedML Production Oriented
for Federated Domain Adaptation. arXiv preprint
Flower Production Oriented
arXiv:2106.07578, 2021.
Table 8. Most common platforms for FL and their Focus Dutta, S., Joshi, G., Ghosh, S., P., D., and P., N. Slow and
stale gradients can win the race: Error-runtime trade-offs
Bonawitz, K., Eichner, H., Grieskamp, W., Huba, D., Inger- in distributed sgd. In Proc. of the 21st Intl. Conf. on
man, A., Ivanov, V., Kiddon, C., Konečnỳ, J., Mazzocchi, Artificial Intelligence and Statistics, pp. 803–812, 2018.
S., McMahan, B., et al. Towards federated learning at
scale: System design. Proceedings of Machine Learning Dwork, C., Roth, A., et al. The algorithmic foundations of
and Systems, 1:374–388, 2019. differential privacy. Found. Trends Theor. Comput. Sci.,
9(3-4):211–407, 2014.
Byrd, D. and Polychroniadou, A. Differentially private
secure multi-party computation for federated learning in Enthoven, D. and Al-Ars, Z. An overview of federated deep
financial applications, 2020. learning privacy attacks and defensive strategies. arXiv
Chen, K. and Huo, Q. Scalable training of deep learning preprint arXiv:2004.04676, 2020.
machines by incremental block training with intra-block
Gabriel, E., Fagg, G. E., Bosilca, G., Angskun, T., Dongarra,
parallel optimization and blockwise model-update filter-
J. J., Squyres, J. M., Sahay, V., Kambadur, P., Barrett, B.,
ing. In Proc. ICASSP, March 2016.
Lumsdaine, A., Castain, R. H., Daniel, D. J., Graham,
Cohen, G., Afshar, S., Tapson, J., and Van Schaik, A. Em- R. L., and Woodall, T. S. Open MPI: Goals, concept, and
nist: Extending mnist to handwritten letters. In 2017 Inter- design of a next generation MPI implementation. In Pro-
national Joint Conference on Neural Networks (IJCNN), ceedings, 11th European PVM/MPI Users’ Group Meet-
pp. 2921–2926. IEEE, 2017. ing, pp. 97–104, Budapest, Hungary, September 2004.
Diao, Q., Qiu, M., Wu, C.-Y., Smola, A. J., Jiang, J., and Go, A., Bhayani, R., and Huang, L. Twitter sentiment
Wang, C. Jointly modeling aspects, ratings and sentiments classification using distant supervision. Stanford Tech.
for movie recommendation (jmars). In Proc. of the 20th Report, 2009.
ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, 2014. Gopi, S., Lee, Y. T., and Wutschitz, L. Numerical
composition of differential privacy. arXiv preprint
Dimitriadis, D., Kumatani, K., Gmyr, R., Gaur, Y., and Es- arXiv:2106.02848, 2021.
kimez, E. S. Federated Transfer Learning with Dynamic
Gradient Aggregation. arXiv preprint arXiv:2008.02452, Gu, X., Mao, Y., Han, J., Liu, J., Wu, Y., Yu, C., Finnie,
2020a. D., Yu, H., Zhai, J., and Zukoski, N. Generating repre-
sentative headlines for news stories. In Proc. of WWW Reddi, S. J., Charles, Z., Zaheer, M., Garrett, Z., Rush, K.,
Conf’20, 2020. Konecny, J., Kumar, S., and B., M. H. Adaptive Federated
Optimization. arXiv preprint arXiv:2003.00295v1, 2020.
Hsu, Y., Liu, Y., Ramasamy, A., and Kira, Z. Re-
evaluating Continual Learning Scenarios: A Catego- Rosenblatt, F. The perceptron: A probabilistic model for
rization and Case for Strong Baselines. arXiv preprint information storage and organization in the brain. Psy-
arXiv:1810.12488v4, 2018. chological review, 65, 1958.
Jhunjhunwala, D., Gadhikar, A., Joshi, G., and Eldar, Sahoo, D., Pham, Q., Lu, J., and Ho, S. Online Deep
Y. C. Adaptive quantization of model updates for Learning: Learning Deep Neural Networks on the Fly.
communication-efficient federated learning, 2021. arXiv preprint arXiv:1711.03705, 2017.
Kingma, D. P. and Ba, J. Adam: A method for stochastic Seide, F., Fu, H., Droppo, J., Li, G., and Yu, D. 1-bit stochas-
optimization, 2017. tic gradient descent and its application to data-parallel
distributed training of speech dnns. In Fifteenth Annual
Konecny, J., McMahan, B. H., and Ramage, D. Federated Conference of the International Speech Communication
Optimization: Distributed Optimization Beyond the Dat- Association, 2014.
acenter. arXiv preprint arXiv:1511.03575v1, 2015.
Sergeev, A. and Bals, M. D. Horovod: Fast and Easy Dis-
LeCun, Y. and Cortes, C. MNIST handwritten digit tributed Deep Learning in TensorFlow. arXiv preprint
database, 2010. URL http://yann.lecun.com/ arXiv:1802.05799, 2018.
exdb/mnist/.
Shamir, O., Srebro, N., and Zhang, T. Communication
Li, T., Sahu, A. K., Talwalkar, A., and Smith, V. Federated Efficient Distributed Optimization Using an Approximate
Learning: Challenges, Methods, and Future Directions. Newton-type Method. arXiv preprint arXiv:1312.7853,
arXiv preprint arXiv:1908.07873v1, 2019. 2013.
Lian, X., Huang, Y., Li, Y., and Liu, J. Asynchronous par- Strom, N. Scalable distributed dnn training using commod-
allel stochastic gradient for nonconvex optimization. In ity gpu cloud computing. In Sixteenth Annual Conference
Proc. of 28th Intl. Conf. on Advances in Neural Informa- of the International Speech Communication Association,
tion Processing Systems (NIPS 2015), 2015. 2015.
Tang, D., Qin, B., and Liu, T. Document modeling
Liang, X., Javid, A. M., Skoglund, M., and Chatterjee, S.
with gated recurrent neural network for sentiment clas-
Asynchronous Decentralized Learning of a Neural Net-
sification. In Proc.of ACL International Conference
work. arXiv preprint arXiv:2004.05082v1, 2020.
EMNLP’15, 2015.
McMahan, H. B., Moore, E., Ramage, D., Hampson, S., and
Wangni, J., Wang, J., Liu, J., and Zhang, T. Gradient spar-
Arcas, B. Communication-efficient Learning of Deep
sification for communication-efficient distributed opti-
Networks from Decentralized Data. In Proc. Interna-
mization. Advances in Neural Information Processing
tional Conference on Artificial Intelligence and Statistics,
Systems, 31:1299–1309, 2018.
pp. 1273––1282, 2017.
Wei, K., Li, J., Ding, M., Ma, C., Yang, H. H., Farokhi, F.,
Mironov, I. Rényi differential privacy. In 2017 IEEE 30th Jin, S., Quek, T. Q. S., and Poor, H. V. Federated learning
Computer Security Foundations Symposium (CSF), pp. with differential privacy: Algorithms and performance
263–275. IEEE, 2017. analysis. IEEE Transactions on Information Forensics
Panayotov, V., Chen, G., Povey, D., and Khudanpur, S. Lib- and Security, 15:3454–3469, 2020. doi: 10.1109/TIFS.
riSpeech: an ASR corpus based on public domain audio 2020.2988575.
books. In Proc. International Conference on Acoustics, Wolford, B. A Guide to GDPR Data Privacy Requirements.
Speech and Signal Processing, 2015. https://gdpr.eu/data-privacy/, 2021.
Parisi, G., Kemker, R., Part, J., Kanan, C., and Wermter, S. You, Y., Gitman, I., and Ginsburg, B. Large batch training
Continual Lifelong Learning with Neural Networks: A of convolutional networks, 2017.
Review. arXiv preprint arXiv:1802.07569, 2018.
You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli,
Patarasuk, P. and Yuan, X. Bandwidth Optimal All-reduce S., Song, X., Demmel, J., Keutzer, K., and Hsieh, C.-J.
Algorithms for Clusters of Workstations. J. Parallel Dis- Large Batch Optimization for Deep Learning: Training
trib. Comput., 69(2):117–124, 2009. BERT in 76 minutes, 2020.
Zhang, J., Chen, B., Cheng, X., Binh, H. T. T., and Yu, proposed (with held-out data matching the tasks in question)
S. PoisonGAN: Generative Poisoning Attacks Against on the server-side, i.e. the model updates are regularized in
Federated Learning in Edge Computing Systems. IEEE a direction matching the held-out data. As such, the model
Internet of Things Journal, 8(5):3310–3322, 2021. does not diverge too much from the task of interest.
Zhao, Y., Li, M., Lai, L., Suda, N., Civin, D., and Chandra, (s) (s)
wt̃+1 = wt̃ − ηw ∇wt̃ (6)
V. Federated learning with non-iid data, 2018.
The convergence speed of training due to the hierarchical
Ziller, A., Trask, A., Lopardo, A., Szymkow, B., Wagner,
B., Bluemke, E., Nounahon, J.-M., Passerat-Palmbach, J.,
Algorithm 2 Adaptive Federated Optimization
Prakash, K., Rose, N., Ryffel, T., Reza, Z. N., and Kaissis,
(s) (j)
G. PySyft: A Library for Easy Federated Learning, pp. 1: Input: ( w0 , xT )
111–139. Springer International Publishing, Cham, 2021. (s)
2: while Model wT
has not converged do
ISBN 978-3-030-70604-3. 3: for j in [0, N ] do
(s)
4: Send seed model wT to j (th) clients
(j)
5: Train local models w(j) in j th -node with data xT
A A DAPTIVE O PTIMIZERS
This section can be found in (Dimitriadis et al., 2021) but it 6: Estimate a smooth approximation of the local gra-
(j)
is herein included for the sake of completeness. dients g̃t for j th -node
7: j ←j+1
The j th client update runs t iterations with t ∈ [0, Tj ], lo- 8: end for
(s) (j)
cally updating the seed model wT (herein shown using the 9: Estimate weights αT
SDG optimizer, without loss of generality) with a learning 10: Aggregate weighted sum of gradients, Equation 5
rate of ηj , 11:
(s)
Update global model wT using aggregated gradient
(j) (j) (j)
wt+1 = wt − ηj ∇wt (2) 12: Update global modal on held-out data, Equation 6
where t is the local iterations on j th -client, i.e., the client 13: end while
(j) (j) def (s)
time steps, and wt the local model and w0 = wT .
optimization scheme is improved by a factor of 2×, without
Then, the j th client returns a smooth approximation of the
(j) any negative impact in performance. Also, the commu-
local gradient g̃T (over the Tj local iterations and T is the nication overhead is significantly lower instead of contin-
iteration “time” on the server side) as the difference between uously transmitting gradients, as in FedSGD (McMahan
(j)
the latest, updated local model wTj and the previous global et al., 2017).
(s)
model wT
(j) (j) (s)
g̃T = wTj − wT (3) B S TALE G RADIENT A NALYSIS
Since estimating the gradients gT is extremely difficult, A complementary approach to deal with the issue of strag-
(j)
hereafter the approximation g̃T is used instead. gling is to use asynchronous SGD. In asynchronous SGD,
(j) any learner can evaluate the gradient and update the central
The gradient samples g̃T are weighted and aggregated, as
PS without waiting for the other learners. Asynchronous
described in Section 4.3
variants of existing SGD algorithms have also been pro-
(s) posed and implemented in systems, e.g., (Agarwal & Duchi,
X (j) (j)
gT = αT g̃T (4)
j
2011; Dutta et al., 2018). In general, analyzing the conver-
gence of asynchronous SGD with the number of iterations
(j)
where αT are the aggregation weights. is difficult in itself because of the randomness of gradient
(s)
staleness.
The global model wT +1 is updated as in (4) (here also shown
using SGD, although not necessary), Gradient descent is a way to iteratively minimize this ob-
jective function by updating the parameter w based on the
(s) (s) (s) (s)
wT +1 = wT − ηs gT . (5) gradient of the model θτ at every iteration τ , as given by
(j) (j) (j)
The process described in Eq. 6 is a form of “Naive Re- θt+1 = θt − η (j) ∇θ Lθ (xi ) (7)
hearsal” (Hsu et al., 2018; Parisi et al., 2018; Sahoo et al.,
(j)
2017). Updating the global model may cause drifting - in for the j client, over the local data mini-batches xi . As
order to mitigate such drifting, an additional training step is described in Section 4.3 the clients are estimating a pseudo-
(j)
gradient g̃Tj +τ at the end of their training cycle, The expectation of the L2-norm of the error is,
E [kEτ k2 ] =
(j) (j)
g̃Tj +τ = θTj − θτ(s) (8) "
X
#
(s) (s)
=E η θτ(s) − θτ −1 eIi,τ αi
i∈I 2
where Tj is the time took for the client j to estimate the h i
(s) (s)
final local model, and θτ is the global/initial model com- ≤η E θτ(s) − θτ −1
2
municated to the client at time τ . As in (Dimitriadis et al.,
(12)
2020a), these pseudo-gradients are weighted and aggregated
P
since eIi,τ αi ≤ 1. According to Eq. 12, the upper-bound
(s) (n)
X
θτ +1 = θτ(s) − η (s) Ij,τ αj g̃Tj +τ (9) of the error term due to the stale gradients is the norm of the
j∈N model differences between updates weighted by the learning
rate η (s) . In other words, the expectation of the norm of the
where N the number of clients per iteration τ , and the error due to stale gradients is bound by the model updates
samples of (in fixed points in time). If we call ∆τ the norm of the
difference between sequential in time models,

1 if Tj + τ ∈ W[τ,τ +1) ∆τ = θτ(s) − θτ −1
(s)
(13)
Ij,τ =
0 else 2
becomes smaller since the models converge to an optimal

and eIj,τ = 1 − Ij,τ point. As such, limτ →∞ ∆τ = 0 and from Eq. 12, the error
due to stale gradients becomes limτ →∞ Eτ = 0.
There are different degrees of staleness and for this work,
the stale gradients are considered to fall at most one iteration The conclusion from Eqs. 12 and 13 is in accordance with
(j)
behind, i.e. some of the gradients g̃Tj +τ −1 are part of the the analysis in (Lian et al., 2015), where it is shown that
aggregation step in Eq. 9 for the window W[τ,τ +1) . In other the convergence rate does not depend on the staleness ratio
words, Eq. 9 now becomes given sufficient number of iterations. It is proved that the
benefits of not waiting for the strangler nodes (thus pro-
 ducing stale gradients) in terms of time needed to converge
(s)
X (j)
X (i) (s)
counter-balance the errors introduced early in the training
θτ +1 = θτ(s) −η (s)  αj (θTj − θτ(s) ) + αi (θTi − θτ −1 )process. Also, based on the analysis in (Dutta et al., 2018),
j∈J i∈I adjusting the learning rate schedule per iteration τ based on
(10) the staleness ∆τ can further expedite convergence,
where J, I is the index of nodes without/with stale gradients
and assuming that J ∪ I = N , i.e, the union of clients

(s) C
with current and stale gradients cover the client space per η τ = min , ηmax , (14)
∆τ
iteration. Assuming that the final models θTj per client,
would reach a similar point regardless of the starting model where C is a predefined constant related to the error floor.
(s)
θτ (a realistic assumption in convex models).
" #
(s) (n) (s)
X X
θτ +1 ≈ θτ(s) −η (s) αn (θTn − θτ(s) ) + αi (θτ(s) − θτ −1 )
n∈N i∈I
(11)
Based on Eqs. 9, 11, the stale gradients of the I nodes
introduces an error term Eτ which depends only on the
weights αi and the difference with the previous model, i.e,
the aggregated gradients of the previous time-step,
X
(s)
Eτ = η (s) θτ(s) − θτ −1 αi
i∈I
X
(s)
= η (s) θτ(s) − θτ −1 eIi,τ αi
i∈I

Flute: A S, E F H - P F L S: Calable Xtensible Ramework For IGH Erformance Ederated Earning Imulations

Uploaded by

Copyright:

Available Formats

You might also like

Flute: A S, E F H - P F L S: Calable Xtensible Ramework For IGH Erformance Ederated Earning Imulations

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Flute: A S, E F H - P F L S: Calable Xtensible Ramework For IGH Erformance Ederated Earning Imulations

Uploaded by

Copyright:

Available Formats

FLUTE: A S CALABLE , E XTENSIBLE F RAMEWORK FOR

H IGH -P ERFORMANCE F EDERATED L EARNING S IMULATIONS

1 I NTRODUCTION proposed as a strategy to address these constraints. Feder-

Figure 2. FLUTE Execution Flow: The server samples M of the

smartphones or private Cloud computes can be a client in 1

it were per-example (i.e. per-client, or per-pseudo-gradient) 4.7 AzureML Integration

Quant. (bits) Acc @1 (%) Rel. Improv. (%)

Table 2. Next-word prediction task: Top-1 accuracy after gradient

5.7 Comparing Optimizers

FLUTE Tensor Flow Federated PySyft

becomes smaller since the models converge to an optimal

You might also like