Exploring The Impact of Attacks On Ring AllReduce

Exploring the Impact of Attacks on Ring AllReduce
Jiayu Wang, Peng Liu, Zehua Sen Liu Chao Yao

Guo∗ Peng Cheng Lab, Fudan University Shaanxi Normal University
Beijing Institute of Technology China China
China
ABSTRACT Training a distributed machine learning model is not safe due to
Distributed Machine Learning (DML) is widely used to accelerate cyberattacks. The malicious workers, which are controlled by the
the training of the deep learning model. In DML, Parameter-Server attacker, can send the wrong gradients or parameters to the server.
(PS) and Ring AllReduce are two typical architectures. Recently, An example is presented in Figure 1. In the example, the gradients
observing that many works address the security problem in PS, estimated by the correct workers have a similar direction, the one
whose performance can be greatly degraded by malicious participa- from the malicious worker can be different from the correct ones.
tion during the training process. However, the robustness of Ring In this way, a wrong estimation of the global gradients is produced,
AllReduce, which can solve the communication bandwidth problem leading to degraded performance or divergence of the model.
in PS, to the malicious participant is still unknown. In this paper, Based on the scheme in Figure 1, different attacks are proposed
we design a series of experiments to explore the security problem recently [2]. To prevent PS from the attack, many works are also
in Ring AllReduce, and reveal it can also suffer from the malicious proposed. [1] kicks out the malicious workers to prevent PS from at-
participant. tack. [4] aggregates the gradients of each worker, rather than briefly
calculate the average of them, to obtain robust global gradient.
ACM Reference Format: However, existing works mainly focus on the security problem
Jiayu Wang, Peng Liu, Zehua Guo, Sen Liu, and Chao Yao. 2021. Exploring
in PS. To the best of our knowledge, the impact of malicious work-
the Impact of Attacks on Ring AllReduce. In 5th Asia-Pacific Workshop on
ers on the performance of Ring AllReduce is still unknown yet.
Networking (APNet 2021), June 24–25, 2021, Shenzhen, China, China. ACM,
New York, NY, USA, 2 pages. https://doi.org/10.1145/3469393.3469676 Ring AllReduce does not have a central server, and gradients are
processed as they arrive at workers instead of being aggregated
1 INTRODUCTION centrally at the server. Thus, none of the workers can work from a
Deep learning has been successfully applied to many applications global perspective, and existing defending methods for PS cannot
in recent years. However, the high cost in computation and memory be directly used. In this paper, we address the security problem
of deep learning still hinder its implements. To solve this problem, in Ring AllReduce. We design several experiments to test the in-
Distributed Machine Learning (DML) has been proposed by us- fluence of malicious workers when training the DML models in
ing many machines simultaneously. Based on different topological Ring AllReduce. The experimental results show that the malicious
architectures, Parameter Server (PS) and Ring AllReduce are two workers can reduce the accuracy of the models, and even can lead
typical DML methods. In PS, a central server is used to coordinate to the models disconverged in the worst case.
the training of different participates (usually named workers in cost funciton gradient estimates
gradient estimates computed by
DML). The server first sends a model to each worker, then each malicious workers
computed by correct
worker computes the gradient/parameter based on its local data, workers
wrong global
and sends the gradient/parameter back to the server. The server the actual gradient gradient
finally updates the model by the collected gradients/parameters
from the workers. The process will be repeated many times, so the (a) The calculation of the graidnets with- (b) The calculation of the graidnets with ma-
out malicious workers. licious workers.
communication bandwidth is a heavy burden for the central server.
Ring AllReduce [3], which deploys the worker in a ring topology, is Figure 1: The calculation of the gradients with malicious
proposed to solve the problem. In Ring AllReduce, each worker only workers can be highly different with no malicious workers.
needs to receive parameters from one worker and send parameters It can hurt training process with malicious workers.
to another worker. Consequently, the communication bandwidth 2 EXPERIMENT DESIGN
of Ring AllReduce is greatly saved. 2.1 Overview of Ring AllReduce
∗ Zehua
Training the Ring AllReduce architecture contains many iterations.
Guo is the corresponding author.
In each iteration, each worker firstly calculates the local gradients
based on the local data. Then, the model is updated by two processes,
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed Scatter-Reduce and Allgather.
for profit or commercial advantage and that copies bear this notice and the full citation In Scatter-Reduce, each worker sends the gradients to the fixed
on the first page. Copyrights for components of this work owned by others than ACM adjacent worker and receives the gradients from another fixed
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a adjacent worker. After N − 1 steps (where N is the number of total
fee. Request permissions from permissions@acm.org. workers), each worker has parts of the global updated model. Then
APNet 2021, June 24–25, 2021, Shenzhen, China, China Allgather is adopted to update the global model. After N − 1 steps,
© 2021 Association for Computing Machinery.
ACM ISBN 978-1-4503-8587-9/21/06. . . $15.00 each worker gets the updated global model. Figure 2 shows the
https://doi.org/10.1145/3469393.3469676 scheme of one iteration in Ring AllReducde.
APNet 2021, June 24–25, 2021, Shenzhen, China, China Jiayu Wang, Peng Liu, Zehua Guo, Sen Liu, and Chao Yao
Send Receive Send Receive

Worker4 Worker1 Worker4 Worker1 because that, the proportion of attackers in participants is reduced
Receive Send Receive step2 Send in this situation. In fact, if there are more attackers, the decrease of
step1 step3
Send Receive Send Receive the training accuracy and the divergence of the training model will
Worker3 Worker 2 Worker3 Worker2
be more obvious.
Receive Send Receive Send
0.8 No attacker
2.0 Gradient flipping attack
No attacker
Testing accuracy
(a) Each worker only receives parameters (b) When there is a malicious worker, the Gradient ascent attack
Gradient flipping attack
Training loss
from the fixed worker and sends parame- tempered parameters of it will be sent to 0.6 1.5 Same value attack
Gradient ascent attack Random value attack
ters to another fixed worker. all the other workers after N − 1 steps. Same value attack 1.0
0.4 Random value attack Label flipping attack
Figure 2: The basic architecture of Ring AllReduce. Label flipping attack 0.5
2.2 System Attack and Environment Design 0.2
0.0
0
5000 1000015000200002500030000 0 5000 1000015000200002500030000
Based on the aforementioned training process, it is easily found Iterations Iterations
that the local gradients are spread to all of the workers. Thus, we (a) The training accuracy of Alexnet. (b) The training loss of Alexnet.
infer that the malicious worker will also affect the performance of No attacker
0.6 2.0 Gradient flipping attack
Testing accuracy
Ring AllReduce. To prove this thought, we use several attacks as Gradient ascent attack
Training loss
No attacker Same value attack
follows: 0.4 Gradient flipping attack 1.5 Random value attack
Gradient ascent attack Label flipping attack
Gradient flipping attack: The malicious workers calculate Same value attack
0.2 Random value attack 1.0
the local gradients based on the local training data, and multiply Label flipping attack
local gradients by -1 as the transmitted gradients between workers. 0
5000 1000015000200002500030000 0 5000 1000015000200002500030000
Iterations Iterations
In other words, it changes the sign of the parameter. (c) The training accuracy of CNN. (d) The training loss of CNN.
Gradient ascent attack: The malicious workers calculate Figure 3: The results of the experiments with 8 workers.
the local gradients based on the local training data and multiply 2.5
0.8 No attacker
local gradients by -4 as the transmitted gradients between workers. No attacker 2.0 Gradient flipping attack
Testing accuracy Gradient flipping attack Gradient ascent attack
Training loss
Compared with gradient flipping attack, it is a more powerful attack. 0.6 Same value attack
Gradient ascent attack 1.5
Same value attack Random value attack
Random value attack: The malicious workers do not need 0.4 Random value attack 1.0 Label flipping attack
to calculate the gradients. Instead, it will replace each parameter Label flipping attack 0.5
0.2
with a random number and each number will follow the normal 0.0
0
2500 5000 7500 100001250015000 0 2500 5000 7500 100001250015000
distribution and multiply by a coefficient. In our experiment, the Iterations Iterations
coefficient is set to be 0.3. (a) The training accuracy of Alexnet. (b) The training loss of Alexnet.
Same value attack: The malicious workers do not need to No attacker
0.6 No attacker 2.0 Gradient flipping attack
Testing accuracy
calculate the gradients. Instead, it will replace all the parameters Gradient flipping attack Gradient ascent attack
Training loss
Gradient ascent attack Same value attack
with the same value. In our experiment, we set it to 100. 0.4 Same value attack 1.5 Random value attack
Random value attack Label flipping attack
Label flipping attack: For the training data, if the real label Label flipping attack
0.2 1.0
is x, we will set the fake label as 9 − x. Then the malicious workers
calculate gradients based on the fake labels. And send the calculated 0 2500 5000 7500 100001250015000 0 2500 5000 7500 100001250015000
Iterations Iterations
gradients to the server. Then in the Scatter-Reduce process, after (c) The training accuracy of CNN. (d) The training loss of CNN.
N − 1 steps, the tempered parameters will be sent to all the other Figure 4: The results of the experiments with 16 workers.
workers, which can make an influence the global parameters in each
4 CONCLUSION AND FUTURE WORK
worker. The Allgather process further uses the tempered global
parameters to update the global model. As a result, the global model In this paper, we explore the security problem in Ring AllReduce
can be influenced. architecture, and find that the malicious workers indeed affect the
model’s performance. In future, we will evaluate the impact of
3 EXPERIMENT RESULT AND ANALYSIS existing attack methods on complex training models, datasets, and
In our evaluation, we use CIFAR10 as the training dataset, CNN and other processes in Ring AllReduce.
Alexnet as the training model. For the CNN model, there are two ACKNOWLEDGMENTS
convolution layers followed by two fully connected layers. We use
8 and 16 workers for the experiments and perform the experiments This work was supported by the National Natural Science Founda-
with one attacker. tion of China under Grants 62002019 and 62002066, Beijing Institute
We test 6 settings, which are no attackers and 5 aforementioned of Technology Research Fund Program for Young Scholars, Funda-
attacks, both in CNN and Alexnet. The experimental results are mental Research Funds for the Central Universities under Grant
shown in Figures 3 and 4. It should be noted that we stop the 1301032207, and the project "PCL Future Greater-Bay Area Network
training when the model cannot converge (e.g., the loss for Alexnet Facilities for Large-scale Experiments and Applications (LZC0019)".
in Figure 3 (b)). It shows that the attack on Ring AllReduce will REFERENCES
result in a decrease in the training accuracy. In the worst case, it [1] Peva Blanchard and et al. [n.d.]. Machine learning with adversaries: Byzantine
tolerant gradient descent. In NIPS’17.
will divergence the training model. For the same attack, the effects [2] David Enthoven and Zaid Al-Ars. 2020. An overview of federated deep learning
on different training models are distinct, which reveals that the privacy attacks and defensive strategies. arXiv preprint arXiv:2004.04676 (2020).
[3] Pitch Patarasuk and Xin Yuan. 2009. Bandwidth optimal all-reduce algorithms for
architecture of the model leads to different abilities to resist attack. clusters of workstations. Elsevier JPDC 69, 2 (2009), 117–124.
In the experiment, when there are 16 workers with one attacker, the [4] Dong Yin and et al. [n.d.]. Byzantine-robust distributed learning: Towards optimal
effect of the attack can be decreased compared with 8 workers. It is statistical rates. In ICML’18.

Exploring The Impact of Attacks On Ring AllReduce

Uploaded by

Copyright:

Available Formats

You might also like

Exploring The Impact of Attacks On Ring AllReduce

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Exploring The Impact of Attacks On Ring AllReduce

Uploaded by

Copyright:

Available Formats

Exploring the Impact of Attacks on Ring AllReduce

Jiayu Wang, Peng Liu, Zehua Sen Liu Chao Yao

Send Receive Send Receive

You might also like