Professional Documents
Culture Documents
Exploring The Impact of Attacks On Ring AllReduce
Exploring The Impact of Attacks On Ring AllReduce
Exploring The Impact of Attacks On Ring AllReduce
Testing accuracy
(a) Each worker only receives parameters (b) When there is a malicious worker, the Gradient ascent attack
Gradient flipping attack
Training loss
from the fixed worker and sends parame- tempered parameters of it will be sent to 0.6 1.5 Same value attack
Gradient ascent attack Random value attack
ters to another fixed worker. all the other workers after N − 1 steps. Same value attack 1.0
0.4 Random value attack Label flipping attack
Figure 2: The basic architecture of Ring AllReduce. Label flipping attack 0.5
2.2 System Attack and Environment Design 0.2
0.0
0
5000 1000015000200002500030000 0 5000 1000015000200002500030000
Based on the aforementioned training process, it is easily found Iterations Iterations
that the local gradients are spread to all of the workers. Thus, we (a) The training accuracy of Alexnet. (b) The training loss of Alexnet.
infer that the malicious worker will also affect the performance of No attacker
0.6 2.0 Gradient flipping attack
Testing accuracy
Ring AllReduce. To prove this thought, we use several attacks as Gradient ascent attack
Training loss
No attacker Same value attack
follows: 0.4 Gradient flipping attack 1.5 Random value attack
Gradient ascent attack Label flipping attack
Gradient flipping attack: The malicious workers calculate Same value attack
0.2 Random value attack 1.0
the local gradients based on the local training data, and multiply Label flipping attack
local gradients by -1 as the transmitted gradients between workers. 0
5000 1000015000200002500030000 0 5000 1000015000200002500030000
Iterations Iterations
In other words, it changes the sign of the parameter. (c) The training accuracy of CNN. (d) The training loss of CNN.
Gradient ascent attack: The malicious workers calculate Figure 3: The results of the experiments with 8 workers.
the local gradients based on the local training data and multiply 2.5
0.8 No attacker
local gradients by -4 as the transmitted gradients between workers. No attacker 2.0 Gradient flipping attack
Testing accuracy Gradient flipping attack Gradient ascent attack
Training loss
Compared with gradient flipping attack, it is a more powerful attack. 0.6 Same value attack
Gradient ascent attack 1.5
Same value attack Random value attack
Random value attack: The malicious workers do not need 0.4 Random value attack 1.0 Label flipping attack
to calculate the gradients. Instead, it will replace each parameter Label flipping attack 0.5
0.2
with a random number and each number will follow the normal 0.0
0
2500 5000 7500 100001250015000 0 2500 5000 7500 100001250015000
distribution and multiply by a coefficient. In our experiment, the Iterations Iterations
coefficient is set to be 0.3. (a) The training accuracy of Alexnet. (b) The training loss of Alexnet.
Same value attack: The malicious workers do not need to No attacker
0.6 No attacker 2.0 Gradient flipping attack
Testing accuracy
calculate the gradients. Instead, it will replace all the parameters Gradient flipping attack Gradient ascent attack
Training loss
Gradient ascent attack Same value attack
with the same value. In our experiment, we set it to 100. 0.4 Same value attack 1.5 Random value attack
Random value attack Label flipping attack
Label flipping attack: For the training data, if the real label Label flipping attack
0.2 1.0
is x, we will set the fake label as 9 − x. Then the malicious workers
calculate gradients based on the fake labels. And send the calculated 0 2500 5000 7500 100001250015000 0 2500 5000 7500 100001250015000
Iterations Iterations
gradients to the server. Then in the Scatter-Reduce process, after (c) The training accuracy of CNN. (d) The training loss of CNN.
N − 1 steps, the tempered parameters will be sent to all the other Figure 4: The results of the experiments with 16 workers.
workers, which can make an influence the global parameters in each
4 CONCLUSION AND FUTURE WORK
worker. The Allgather process further uses the tempered global
parameters to update the global model. As a result, the global model In this paper, we explore the security problem in Ring AllReduce
can be influenced. architecture, and find that the malicious workers indeed affect the
model’s performance. In future, we will evaluate the impact of
3 EXPERIMENT RESULT AND ANALYSIS existing attack methods on complex training models, datasets, and
In our evaluation, we use CIFAR10 as the training dataset, CNN and other processes in Ring AllReduce.
Alexnet as the training model. For the CNN model, there are two ACKNOWLEDGMENTS
convolution layers followed by two fully connected layers. We use
8 and 16 workers for the experiments and perform the experiments This work was supported by the National Natural Science Founda-
with one attacker. tion of China under Grants 62002019 and 62002066, Beijing Institute
We test 6 settings, which are no attackers and 5 aforementioned of Technology Research Fund Program for Young Scholars, Funda-
attacks, both in CNN and Alexnet. The experimental results are mental Research Funds for the Central Universities under Grant
shown in Figures 3 and 4. It should be noted that we stop the 1301032207, and the project "PCL Future Greater-Bay Area Network
training when the model cannot converge (e.g., the loss for Alexnet Facilities for Large-scale Experiments and Applications (LZC0019)".
in Figure 3 (b)). It shows that the attack on Ring AllReduce will REFERENCES
result in a decrease in the training accuracy. In the worst case, it [1] Peva Blanchard and et al. [n.d.]. Machine learning with adversaries: Byzantine
tolerant gradient descent. In NIPS’17.
will divergence the training model. For the same attack, the effects [2] David Enthoven and Zaid Al-Ars. 2020. An overview of federated deep learning
on different training models are distinct, which reveals that the privacy attacks and defensive strategies. arXiv preprint arXiv:2004.04676 (2020).
[3] Pitch Patarasuk and Xin Yuan. 2009. Bandwidth optimal all-reduce algorithms for
architecture of the model leads to different abilities to resist attack. clusters of workstations. Elsevier JPDC 69, 2 (2009), 117–124.
In the experiment, when there are 16 workers with one attacker, the [4] Dong Yin and et al. [n.d.]. Byzantine-robust distributed learning: Towards optimal
effect of the attack can be decreased compared with 8 workers. It is statistical rates. In ICML’18.