Professional Documents
Culture Documents
2022 FedRL ICPP2022
2022 FedRL ICPP2022
2022 FedRL ICPP2022
net/publication/367158698
CITATIONS READS
9 51
8 authors, including:
All content following this page was uploaded by Phi Le Nguyen on 06 March 2023.
Diabetes Hypertension Others but there are also those that are not. For more examples, see Sec-
tion 2.2. To fill in this gap, in this work, beside considering common
types of non-IID data as existing studies (we name them as i.e.,
label skew non-IID), we introduce a new non-IID data that exhibits
non-independent property of data across clients. Specifically, we
Patient
The remainder of the paper is organized as follows. Section 2 pro- and streets cameras. However, cars commonly appear on streets
vides a short summary of the standard federated learning and non- camera while tables commonly appear on home cameras [16].
IID challenges. We then discuss our motivational study and present This work focuses on label skew, which is common when data is
our proposal in Section 3. The experimental results are described generated from heterogeneous users. Unlike the previous research,
in Section 4 and 5. We finally conclude in Section 6. we also consider data correlation between clients. Such kind of
correlation has been observed in [6] with their real-world dataset,
2 PRELIMINARIES i.e., Flickr-Mammal dataset [7], and also can be observed from the
number of cancer incidence cases in the USA from 2014 to 2018
2.1 Federated Learning and FedAvg
for the cancer diagnosis tasks [5]. In which clients who share the
Federated Learning (FL) is the method that involves a large number same feature, such as geographic region, timeline, or ages, may
of 𝐾 clients, usually hundreds to thousands, training locally and have the same data distribution. First, global distribution of data
updating the Deep Learning (DL) model globally by using a weight label is not uniform, e.g, the most popular mammals (cat) has 23×
aggregation method at a centralized server. The FL finds a DL model images greater than the less popular one (skunk) [6]. Similarly, we
𝑤 ∗ to minimize the objective function 𝑓 (𝑤) = 𝑘=1
Í𝐾 𝑛𝑘
𝑛 𝐹𝑘 (𝑤), found that number of incidence cases in the digestive system (the
where 𝑝𝑘 = 𝑛𝑛𝑘 denotes the relative sample size (𝑛 = 𝑘=1
Í 𝐾 𝑛 ),
𝑘 most popular) is 98× of the incidence cases in eye and orbit (the
1 Í
and 𝐹𝑘 (𝑤) = 𝑛𝑘 𝑖 ∈ D𝑘 𝑙𝑖 (𝑤) is the local objective function of the less popular) [5]. Second, we observe that data labels are frequently
client 𝑖. In which, 𝑙𝑖 (𝑤) is the corresponding loss function for sam- partitioned into clusters, and a group of clients owning the data
ple 𝑖 and D𝑘 denotes the local dataset of the client 𝑘 (𝑛𝑘 = |D𝑘 |). labels belong to the same clusters. For example, users from Oceania
Federated Averaging (FedAvg) [17] is the first and becomes one usually share the image of kangaroos/koalas in Flick, while users
of the most commonly used FL methods to aggregate the locally from Africa share zebras/antelopes [6]. Similarly, people under the
trained models of the client into the shared global model. This age of 20 frequently get leukemia (28%), while people between
method randomly selects a subset of 𝐾 clients in each communica- the ages of 20 to 34 frequently get cancer in the endocrine system
tion round. Each client then performs the local training in 𝐸 epochs (21%) [5]. In this paper, we name this phenomenon as cluster skew.
on its local dataset and uploads its local trained model to the server
2.2.2 Strategies for dealing with non-IID. In this study, we focus on
synchronously. Upon receiving local models from clients, the server
obtaining a DL model that (1) minimized global objective function
the performs weighted aggregation as follows.
over the concatenation of all the local data and (2) balanced the local
𝐾 𝐾
∑︁ ∑︁ 𝑛 objective functions of all the clients. Such target is also mentioned as
𝑛← 𝑛𝑘 ; 𝑤𝑡 +1 ← 𝑘
𝑤𝑘𝑡 . (1)
𝑛 the “fairness" issue among clients [23, 25]. There could be a different
𝑘=1 𝑘=1
approach for each category of non-IID discussed above. Most of
Such weight aggregation strategy may lead to an unfair result where
the existing work proposed to solve the statistical heterogeneity
data distribution is imbalanced among clients (non-IID data) [8].
problems caused by non-IID dataset focused on the label skew and
quantity skew non-IID [2, 8, 12, 13, 17, 21–25, 27, 29]. A few works
2.2 Federated Learning with non-IID data target to the feature skew non-IID [14]. Kevin et. al. [6] highlight
2.2.1 Categories of non-IID Data. In this work, We follow the tax- the issue that we named cluster skew when building up their real-
onomy of non-IID (non independent and identical) data presented world dataset , yet focus on solving the communication overhead
in [6] (Appendix-K) and [31]. Non-IID data can be formed by how problem instead of the statistical heterogeneity problem.
data fail to be distributed identically across clients including:
Feature skew or attribute skew: clients have the same label 2.3 Basic of Reinforcement Learning
sets2 but features of samples are different between clients [31]. For
A Reinforcement Learning (RL) model is essentially comprised of
example, in cancer diagnosis tasks, the cancer types are similar
four components: the agent, the environment with which the agent
across hospitals but the radiology images can vary depending on
interacts and exploits, the reward signal, which indicates how well
the different imaging protocols or scanner generations used in
the agent behaves regarding a specific target; and the policy, which
hospitals [14].
corresponds to the way the agent behaves. The ultimate goal of the
Quantity skew: Different clients hold different amounts of data.
agent is to maximize the total reward signal:
For example, the younger generation may use mobile devices more
frequently than the older generation, thus providing more local 𝑇
∑︁
G= 𝛾 𝑡 −1𝑟𝑡 , (2)
data to train the face recognition DL model [23].
𝑡 =1
Label skew: represents the scenarios where the label distribu-
tion differs between clients. In label size imbalance non-IID pro- where 𝛾 is the discount factor and 𝑟𝑡 is the immediate reward signal
posed by [17], each client has a fixed number of 𝑐 label classes; 𝑐 de- at time step 𝑡.
termines the degree of label imbalance. Recent works [8, 13, 22, 24] There are two main approaches to an RL problem, namely value-
distribute the samples of a particular label to the clients following based learning and policy-based learning. Value-based methods
a particular distribution such as the power-law or Dirichlet distri- focus on learning the value of states 𝑉 (𝑠) : ∫ → R or of state-action
bution. Such kind of scenario is more close to the real-world data. pairs 𝑄 (𝑠, 𝑎) : S × A → R, where S and A are the state space and
For example, a car and a table can appear in both home cameras the action space, respectively. The state-value 𝑉 (𝑠) depicts how
good state 𝑠 is in terms of maximizing the total reward. The state-
2 we use the terms labels and classes interchangeably in this work. action value 𝑄 (𝑠, 𝑎) indicates how beneficial it is to perform action
3
ICPP ’22, August 29-September 1, 2022, Bordeaux, France Nang Hung Nguyen et al.
SERVER
better knowledge should be given a higher impact factor. Mean-
(4) DRL-based Impact factors
impact factor
(5) Model
aggregation
while, in terms of the latter, knowledge that repeatedly gained
determination
by multiple clients should be allocated a lower impact factor to
(3) Sending the
(1) Sending the avoid the model over fitting specific knowledge. The most basic
local models and
global model aggregation mechanism is proposed by FedAvg [17], where impact
other information CLIENTS
Side thread Target value -soft Main value ii) Offline training phase which utilizes Target value -soft Main value
the collected experience to boost the
Value performance of the networks Value
loss loss
Experience Batch of
update update
Buffer experiences
Centralized Main
Policy Agent Policy
Buffer
loss loss
Target policy -soft Main policy Gathering Target policy -soft Main policy
(a) Details of the DRL-based impact factor determination (b) The proposed two-stage training strategy (c) Structures of the policy and value networks
the gradients or weights of the DL model, as follows: In summary, the state 𝑠𝑡 is a tuple of 3 × 𝐾 parameters, 𝑠𝑡 =
𝑡 , ..., 𝑙 𝑡 , 𝑙 𝑡 , ..., 𝑙 𝑡 , 𝑛 , ..., 𝑛 }, where the first 𝐾 parameters
{𝑙 1,𝑏 𝐾,𝑎 1
∑︁
w 𝑡 +1
= W · 𝛼® = 𝑡 𝑡
𝛼𝑘𝑡 · 𝑤𝑘𝑡 , (4) 𝐾,𝑏 1,𝑎 𝐾
𝑘 ∈K
are the losses evaluated right after the clients retrieve the global
model from the server; 𝑙 1,𝑎 𝑡 , ..., 𝑙 𝑡
𝐾,𝑎 represent clients’ losses evalu-
where K depicts the set of participating clients, and w𝑘𝑡 represents ated when they finish the local training process at round 𝑡, and the
the 𝑘-th client’s model at round 𝑡, w𝑡 +1 denotes the global model at finally 𝐾 parameters 𝑛 1, ..., 𝑛𝐾 depict the number of samples of the
round 𝑡+1. Since we assume only a fraction of the clients participates clients.
in each round, the number of involved clients at every round, i.e.,
|K |, is less than the total number of clients, i.e., 𝑁 . 3.3.3 Action. The agent aims to identify the clients’ impact factors,
which are continuous numbers. To this end, we define an action
3.3.2 State. The state should provide any information that assists by a set of 𝐾 Gaussian distributions. Specifically, action 𝑎𝑡 is a
the decision-making process, i.e., determining the impact factors tuple of 2 × 𝐾 values comprising the means and variances of 𝐾
used to aggregate the clients’ local models in our scenario. The Gaussian distributions, i.e., 𝑎𝑡 = {𝜇𝑡1, ..., 𝜇𝐾
𝑡 , 𝜎 𝑡 , ..., 𝜎 𝑡 }. Given action
1 𝐾
ultimate aim of the aggregation process at the server is to exploit 𝑎𝑡 , the agent then determines the impact factor for the 𝑘-th client
meaningful information learned by the clients to create a global as follows:
model that works well with all data. It implies that the state must 𝛼𝑘𝑡 = Softmax( N (𝜇𝑘𝑡 , 𝜎𝑘𝑡 )). (5)
contain information that allows the agent to determine: 1) whether
In addition, to enhance the model’s stability throughout training,
knowledge gained by each client is valuable and 2) how this infor-
we add the following constraint:
mation should be aggregated to produce a global model that works
effectively with all clients without bias toward any particular client. 𝜎®𝑡 ≤ 𝛽 𝜇®𝑡 , (6)
As a result, the state should express the quality of models locally where 𝛽 is a hyperparameter ranging from 0 to 1.
trained by clients and the relationship between the clients’ data. To
this end, we design the state to include the following information: 3.3.4 Reward. The design of the reward function is critical to the
• The losses of the global model on each client’s datasets. These efficacy of the RL model since it directs the agent’s behaviors. We
parameters are measured when the clients obtain the global intend to create a global model that works well for all data, i.e.,
model from the server (at the beginning of each communica- it should be suitable for all clients’ data and not biased towards a
tion round). It’s worth noting that the models for all clients specific client. As a result, the reward function will have two objec-
are identical at this time. Therefore, the losses reflect the tives: 1) Improving the global model’s accuracy across all clients. 2)
correlation between the clients’ data. At the same time, these Balancing the global model’s performance over all clients’ datasets.
losses also depicts how good the global model is for each We quantify the first goal by reducing the global model’s average
client’s data. loss across all client datasets. Besides, we achieve the second goal
• The losses of local clients after performing local training at by minimizing the gap between the global model’s maximum and
the end of each communication round. This parameter is minimum losses on against clients’ datasets. Accordingly, we design
calculated before the clients send their locally trained models the reward as follows:
𝐾
!
to the server. This information tells us how good the a model
1 ∑︁ 𝑡 𝐾 𝑡
𝐾
𝑡
𝑟𝑡 = 𝑙𝑘,𝑏 + max (𝑙𝑘,𝑏 ) − min (𝑙𝑘,𝑏 ) , (7)
trained by each client is. 𝐾
𝑘=1
𝑘=1 𝑘=1
• The amount of data held by each client. Clients with a dif-
where the first term, i.e., 𝐾1 𝑘=1
Í𝐾 𝑡
ferent numbers of training data will, obviously, produce 𝑙𝑘,𝑏 , represents the average losses
trained models with varying quality. Models trained on small of the global model against the local data of all clients, and the
𝐾 (𝑙 𝑡 ) − min𝐾 (𝑙 𝑡 ), indicates the bias of
second term, i.e., max𝑘=1
datasets, in particular, are frequently less informative than 𝑘,𝑏 𝑘=1 𝑘,𝑏
models trained on large datasets. the global model’s performance on the clients’ datasets.
5
ICPP ’22, August 29-September 1, 2022, Bordeaux, France Nang Hung Nguyen et al.
Algorithm 1: Basic training process Table 1: Configuration of the policy and value networks
Input :𝜃 -parameterized main policy network 𝜋 ,
𝜙-parameterized main value network 𝑄, experience Hyper-parameters Values
′ ′
buffer D, target networks 𝜋 , 𝑄 as copies of 𝜋 and 𝑄. 𝜋 -network’s #layer 3
1
′
For each experience (𝑠, 𝑎, 𝑟, 𝑠 ), assign a priority based on the 𝑄-network’s #layer 3
temporal difference: 𝑒.prior ← |𝑟 + 𝛾𝑄 (𝑠 ′, 𝑎) − 𝑄 (𝑠, 𝑎) |; Hidden layer size 256
2 Sort D by the descending order of the priority; 𝜋 -network learning rate 0.0001
𝑄-network learning rate 0.001
3 for 𝑏 times updating do
Experience buffer size 100000
4 Sample a batch 𝐵 = (𝑠, 𝑎, 𝑟, 𝑠 ′ ) ∼ D;
Discount factor 𝛾 0.99
5 Compute targets: 𝑦 (𝑟, 𝑠 ′ ) ← 𝑟 + 𝛾𝑄 ′ (𝑠 ′, 𝜋 ′ (𝑠 ′ ));
Soft main-target update factor 𝜌 0.02
6 Update the main value network with gradient descent:
1 Í ′ 2
∇𝜙 |𝐵| (𝑠,𝑎,𝑟,𝑠 ′ ) ∈𝐵 (𝑄 (𝑠, 𝑎) − 𝑦 (𝑟, 𝑠 )) ;
7 Update the main policy network with gradient ascent: vector containing the means and variances of 𝐾 Gaussian
∇𝜃 𝐵1 𝑠∈𝐵 𝑄 (𝑠, 𝜋 (𝑠));
Í
distributions. The agent then calculates each client’s impact
8 Update target networks with: factors and aggregates the clients’ models.
9 𝜙 ′ ← 𝜌𝜙 ′ + (1 − 𝜌)𝜙, 𝜃 ′ ← 𝜌𝜃 ′ + (1 − 𝜌)𝜃 ; • Agent calculates reward 𝑟𝑡 using formula (7) and stores a
10 end tuple of (𝑎𝑡 , 𝑠𝑡 , 𝑟𝑡 , 𝑠𝑡 +1 ) into the experience buffer.
Parallelly with the main thread, the agent also runs the side
thread continuously to improve the policy network’s performance.
In the following, we will describe the details of our basic train-
3.4 Deep Reinforcement Learning Strategy
ing process used in the side thread. The pseudo-code of our basic
With the reinforcement learning terms defined in the previous training method is shown in Algorithm 1.
section, in what follows, we describe the structure of our DRL
• Firstly, the agent extracts from the experience buffer a set
network as well as the training strategies. We first present the
of experiences. Each experience is a tuple of (𝑎𝑡 , 𝑠𝑡 , 𝑟𝑡 , 𝑠𝑡 +1 ).
structure of the model and a basic training strategy Section in 3.4.1.
Here, we apply the temporal difference-prioritizing strategy
To enhance the performance and reduce the training time, we then
to prioritize the experiences. Specifically, each experience
propose a two-stage training in Section 3.4.2.
𝑒𝑡 = (𝑎𝑡 , 𝑠𝑡 , 𝑟𝑡 , 𝑠𝑡 +1 ) will be assigned a priority 𝛿𝑡 = 𝑟𝑡 +
3.4.1 Basic training with temporal difference-based experience se- 𝛾𝑄 (𝑠𝑡 +1, 𝑎𝑡 ) − 𝑄 (𝑠𝑡 , 𝑎𝑡 ). The higher the priority, the likely
lection. Inspired by DDPG [15], we design a DRL agent as shown the experience is chosen (Lines 1-2 in Algorithm 1).
in Fig. 3(a). The agent maintains two types of networks: the value • Using the selected experiences, the agent will update the
network and the policy network. The former is responsible for esti- main value network using the gradient descent method, and
mating the goodness of states, while the latter is for determining update the main policy network using the gradient ascent
actions. Moreover, for each type of network (i.e., value network or method (Lines 4-7 in Algorithm 1).
policy network), there are two subnets: the main network and the • Finally, updates of the main network will be transferred to
target network. The main network and the target network have the the target network using 𝜌-soft update mechanism (Line 8-9
same structure. The difference is that the main network directly in Algorithm 1).
interacts with the environment and is updated based on the feed- Figure 3(c) illustrates the structure of the policy and value net-
back from the environment. Meanwhile, the target network is a works. Specifically, we construct our policy networks with 3 fully
soft copy of the main network; it is more stable and, therefore, is connected layers, each contains 256 nodes, in paired with Leaky
used as a reference point in training the main network. The agent’s Relu activation function. The output of a policy network is a flatten
operation is divided into two threads, running in parallel: the main vector of 𝐾 × 2 representing the mean and standard deviation of
thread and the side thread. The former uses the main policy net- the multi-variate Gaussian distribution from which we sample the
work to make decisions, i.e., calculating the impact factors of clients’ impact factor. For the value networks, we use 2 hidden layers of
models. While the latter is responsible for improving the policy net- size 256, the outputs of which are also activated by a Leaky Relu
work’s performance based on the feedback from the environment. function. The full configuration is represented in Table 1.
Specifically, the main thread works as follows.
3.4.2 Two-stage training strategy. In the basic training strategy
• At communication round 𝑡 +1, the agent (i.e., server) receives mentioned above, each communication round provides only one
from each 𝑘-th client a tuple 𝑝𝑘𝑡 = {𝑙𝑘,𝑏𝑡 , 𝑙 𝑡 , 𝑛 , 𝑤 𝑡 }, where
𝑘,𝑎 𝑘 𝑘 experience. As a consequence, the number of experiences utilized
𝑡 𝑡
𝑙𝑘,𝑏 , 𝑙𝑘,𝑎 are the losses of the global model at client 𝑘, and to train the policy and value networks is relatively small. This con-
the losses of the local model trained by client 𝑘 at round 𝑡, straint will limit the performance of the networks as a consequence.
respectively, 𝑛𝑘 is the number of samples of client 𝑘 and 𝑤𝑘𝑡 To this end, we propose a novel two-stage training technique as
is the weights of the model trained by client 𝑘. Agent then shown in Fig 3(b). In the first stage, i.e., named online training
aggregates information received from all clients to establish phase, we create 𝑚 identical agents (which we name workers).
state 𝑠𝑡 +1 . These agents simultaneously interact with the environment to train
• State 𝑠𝑡 +1 will be fed into the main policy network 𝜋 to derive the networks and generate experiences. It is worth noting that
action 𝑎𝑡 +1 . As mentioned in Section 3.3.3, action 𝑎𝑡 +1 is a although the workers are initially identical, they will evolve into
6
ICPP ’22, August 29-September 1, 2022, Bordeaux, France
Algorithm 2: Overall flow of FedDRL communication techniques such as using spare data compression [4,
Input : A server with an 𝑤-parameterized DNN, 𝑁 clients with 18] or hierarchical architecture [28].
separated datasets, maximum communication round 𝑇 , 𝐾 Overhead of FedDRL: Although FedDRL requires some extra
participating clients at each round. computation and communication overhead, it is still practical in
1 Server: Initiate 𝜃 -parameterized policy network 𝜋 , real-world applications. Computation overhead is the inference
𝜙-parameterized value network 𝑄, experience buffer D, target latency at the end of each training round that is trivial when the
′ ′
networks 𝜋 , 𝑄 as copies of 𝜋 and 𝑄; size of local datasets of clients is usually small. It is reported that
2 Server: Initiate global model 𝑤 0 ; such overhead is less than one second in MNIST [29]. It is worthy to
3 for communication round 𝑡 from 0 to 𝑇 do noting that performing the aggregation at the server also requires
4 Server sends 𝐾 copies of 𝑤𝑡 to the participating clients; trivial overhead for calculating the impact factors, i.e., the inference
5 for each client 𝑘 in 𝐾 clients in parallel do phase through the policy network of DRL module. We further
(𝑡,0)
6 𝑤𝑘 ← 𝑤𝑡 estimate this overhead in Section 5.3. For communication overhead,
𝑡 ← Inference loss with 𝑤 (𝑡,0)
7 𝑙𝑘,𝑏 𝑘 our FedDRL only needs some extra floating points numbers for the
8 Local training in 𝐸 epochs with batch 𝑏: inference loss in comparison with the FedAvg.
Í𝐸 Í 𝑛𝑏𝑘 (𝑡,𝑖 )
9 𝑤𝑘𝑡 ← 𝑤𝑡 − 𝜂. 𝑏1 𝑒=1 𝑖=1 ∇𝑙𝑖 (𝑤𝑘 )
10 𝑙𝑘,𝑎 ← Inference loss with 𝑤𝑘
𝑡 𝑡 4 EXPERIMENTS
11 Send 𝑝𝑘𝑡 ←(𝑙𝑘,𝑏
𝑡 , 𝑙𝑡 , 𝑛 ,
𝑘,𝑎 𝑘
𝑤𝑘𝑡 ) to the server; 4.1 Evaluation Methodology
12 end This section describes a wide range of experiments to compare the
13 Server unpacks 𝑝𝑘𝑡 from clients, concatenates to form state 𝑠𝑡 +1 ; proposed FedDRL against two baselines including FedAvg [17] and
14 Server selects action: ( 𝜇®, 𝜎)® ← 𝜋 (𝑠) + 𝜖, 𝜖 ∼ N; FedProx[12]. We evaluate the classification performance using the
15 Server computes impact factor vector: 𝛼® ← softmax(𝑁 ( 𝜇®, 𝜎));
® best top-1 accuracy on test sets of the targeted dataset. We also
16 Server performs weighted aggregation upon concatenated evaluate the SingleSet for being references, i.e., training all the data
parameters-matrix from clients: samples of all the clients in a single machine (e.g., training at the
w𝑡 +1 = W𝑡 · 𝛼®𝑡 = 𝑘 ∈K 𝛼𝑘𝑡 · 𝑤𝑘𝑡 ;
Í
server or in the system of only one client) 4 .
17 Server observes the next state 𝑠 ′ , computes reward 𝑟 ;
18 Server stores (𝑠, 𝑎, 𝑟, 𝑠 ′ ) into D ; 4.1.1 Federated Datasets. We evaluate the accuracy of a deep neu-
19 if D is sufficient then ral network (DNN) model trained on a non-IID partitioned dataset.
20 Server performs procedure in Algorithm 1; In this work, we pick up datasets and DNN models that are curated
21 end from prior work in federated learning. In detail, we use three differ-
22 end ent federated datasets, i.e., CIFAR-100 [9], Fashion-MNIST [26], and
MNIST [10]. To study the efficiency of different federated methods,
it is necessary to specify how to distribute the data to the clients and
distinct individuals over the training process, and hence their col- simulate data heterogeneity scenarios. We study three ways to par-
lected experiences will differ. The workers’ experiences are then tition the targeted datasets by considering the label size imbalance,
merged to build an experience buffer, which will be utilized to train quantity skew, and clustered skew (more detail in Section 2). Al-
a main agent in the second stage. The main agent is trained in an though our main target in this evaluation is the dataset distributed
offline manner without interaction with the environment. Specifi- with the cluster-skew non-IID, we also estimate the performance
cally, we employ the experience from the experience buffer created of our proposal in a conventional way, e.g., Pareto. We distribute
by the first stage to update the main agent’s policy and value net- the training samples among N clients, i.e., 10 and 100 clients, as
works using the gradient ascent and gradient descent methods, follows:
respectively. With such a training strategy, we dramatically enrich • Pareto (denoted as PA): non-IID partition method where
the training data while decreasing the training time. The trained the number of samples of a label among clients following a
main agent is then used to make the decision. power law [12, 13]. We assign 2 labels to each client for the
We summarize the total flow of FedDRL in Algorithm 2. MNIST dataset (20 labels/client for the CIFAR-100).
• Clustered-Equal (denoted as CE): We consider the simple
3.5 Assumptions and Limitations case of cluster skew. We first arrange the clients into groups
The assumptions and limitations of the FedDRL are as follows: with a group that has a significantly higher number than
Targeted communication strategies: Our method is designed the others, e.g., the main group with 𝛿 × 𝑁 clients. The
for synchronous FL where the server wait for all the clients finished higher 𝛿 means more bias toward the main group. We also
their local training (with a fixed training epochs3 ) before starting fix the number of labels in each client to two. The number
to compute the new global model. The technique in this work may of samples per client does not change among clients.
not be suitable for supporting the asynchronous FL [2] where the • Clustered-Non-Equal (denoted as CN): Similar to CE but the
global weight is updated right after receiving the locally trained number of samples per client is unbalanced.
model from one (or some) clients (over 𝐾 client involved each
4 SingleSet
round). Note, however, that our technique is still applicable to other can be the best when the total number of samples across all the clients is
big enough. In this case, the training with SingleSet becomes similar to the training
3 Unless otherwise stated, we fix all the number of epochs of all the clients to 5. on an IID dataset.
7
ICPP ’22, August 29-September 1, 2022, Bordeaux, France Nang Hung Nguyen et al.
Table 2: Characteristic of non-IID partitioned methods. ✓: Table 3: Comparison of top-1 test accuracy to baseline approaches
related non-IID type; ×: not related. Explanation in remarks. with different partitioning methods, i.e., PA, CE and CN. The values
show the best accuracy that each FL method reaches during training.
The best performance results are highlighted in the bold, underlined
Partition Clustered Label Size Quantity font. impr.(a) and impr.(b) are the relative accuracy improvement of
Remarks
Method Skew Imbalance Imbalance
FedDRL compared with the best and the worst baseline FL method
#samples follows a (in percentage), respectively. Significant improvements, i.e., > 1%,
PA × ✓ ✓ power law [13] are also marked in the bold, underlined font.
CE ✓ ✓ × Our proposed method
CIFAR-100 Fashion-MNIST MNIST
CN ✓ ✓ ✓ Our proposed method
Partitioning
PA CE CN PA CE CN PA CE CN
method
L9 L9 L9 SingeSet 75.63 68.6 72.67 90.06 88.9 88.48 97.65 98.87 99.01
L8 L8 L8
L7 L7 L7 10 FedAvg 69.81 62.18 64.6 86.19 82.49 82.97 89.58 98.13 97.98
L6 L6 L6 clients FedProx 71.13 62.25 66.8 86.54 82.52 82.43 97.23 96.89 96.88
Label
Label
Label
L5 L5 L5
L4 L4 L4
FedDRL 72.63 64.51 67.32 86.92 82.59 84.26 97.31 98.13 98.08
L3 L3 L3
L2 L2 L2 impr.(a) 2.11%% 3.63% 0.78% 0.44% 0.08% 1.55% 0.08% 0.00% 0.10%
L1 L1 L1 impr.(b) 4.18% 3.75% 4.21% 0.85% 0.12% 2.22% 8.63% 1.28% 1.24%
L0 L0 L0
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 SingeSet 64.55 71.26 68.73 88.9 88.60 88.35 99.19 98.79 98.82
Client Client Client
100 FedAvg 58.29 60.95 58.17 85.57 84.88 83.82 98.43 97.9 98.22
(a) PA (b) CE (c) CN clients FedProx 56.03 55.55 60.53 86.36 84.78 83.89 98.20 97.95 98.13
FedDRL 58.41 61.12 61.51 86.65 86.71 86.69 98.63 98.03 98.27
Figure 4: Illustration of data partitioning methods with 10 impr.(a) 0.21% 0.28% 1.62% 0.34% 2.16% 3.34% 0.20% 0.08% 0.05%
impr.(b) 4.25% 10.03% 5.74% 1.26% 2.28% 3.42% 0.44% 0.13% 0.14%
clients. The size of the circle shows the number of sam-
ples/client.
Figure 5: Top-1 test accuracy of different FL methods vs. communication round. The results of Fashion-MNIST are are plotted
with the average-smoothed of every 10 communication round to have a better visualization. We omit the result of MNIST due
to the space limitation.
Figure 6: Average (top row) and variance (bottom row) of 4.3 Sensitivity Analysis
inference loss among all clients, normalized to those of Fed- 4.3.1 Impact of the number of participating clients. We next con-
DRL (red line) in the cases of CIFAR-100 dataset, 10 clients. duct a sensitivity study to quantify the impact of the client par-
ticipation level on accuracy. As shown in Figure 7, we change the
number of participating clients 𝐾 at one communication round
from 10 up to 50 over the total 𝑁 = 100 clients when training on
4.2.2 Robustness to the client datasets. Testing accuracy on a test the CIFAR-100 dataset. We observe that varying the number of
dataset that is usually independent and identical helps estimate the participating clients would affect the convergence rate but would
goodness of a Deep Learning (DL) model over the global distribution. not impact the accuracy eventually when the training converges.
However, it is reported that the average accuracy may be high but As the result, the improvement in accuracy of FedDRL over the
the accuracy of each device may not always be high [8] when the other two baseline methods is consistently maintained.
server overly favors some clients during its weight aggregation (due
to the bias of the non-IID dataset). To estimate the robustness of 4.3.2 Impact of the non-IID level. It is well known that the con-
an FL method against clients, we test the global model on the local vergence behaviors of FL methods are sensitive to the degree of
sub-dataset of all the participating clients at the beginning of each non-IID. Our target in this work is to study the effect of the cluster-
communication round, i.e., do the inference pass at clients. Figure 6 skew on testing accuracy. We thus change the number of clients
shows the average (in the top row) and variance (in the bottom row) belonging to the main group (𝛿) to change the level of bias caused
of the inference loss among all the clients. Values are normalized by the clients from the main group. Figure 8 show how the test-
to those of FedDRL (the red line). A value of a FL method that are ing accuracy changed when 𝛿 varied from 0.2 to 0.6 in the case of
under the red line refer that it achieves a lower average inference Fashion-MNIST dataset. First, increasing the non-IID level nega-
loss (or variances) and vise versa. tively affects the testing accuracy for all the FL methods. Second,
9
ICPP ’22, August 29-September 1, 2022, Bordeaux, France Nang Hung Nguyen et al.
Figure 7: Testing accuracy vs. the number of participating Figure 8: Testing accuracy vs. the non- Figure 9: Average server
clients 𝐾 (CIFAR-100, 𝑁 = 100 clients). IID level (Fashion-MNIST, 100 clients). computation time.
Table 4: Top-1 test accuracy with label size imbalanced non- accuracy. In this evaluation, we select the target testing accuracy
iid suggested by [17] on the CIFAR-100 dataset. as the minimum accuracy over FedDRL and the two baselines (as
shown in Table 3). For example, to reach an accuracy of 62, 1%
# Equal Non-equal for the CIFAR-100 dataset with the CE partitioning method in a
clients network of 10 clients, FedDRL requires 80 communication rounds.
SingleSet FedAvg FedProx FedDRL SingleSet FedAvg FedProx FedDRL
FedAvg and FedProx spend 1.16× and 1.2× longer than FedDRL,
10 81.16 75.52 70.18 76.65 81.56 75.5 76.65 76.9 respectively. Overall, the convergence rate of FedDRL is not always
100 81.34 72.71 72.52 73.47 81.59 72.44 72.89 73.42
the fastest. It is interesting to emphasize that the convergence rates
of FedAvg and FedProx do not remain consistent across different
data partitioning methods. Instead, our method FedDRL always
the results also indicate that FedDRL mitigates the negative impact
converges as fast as the fastest one in all the cases. The results
of these biases. For instance, FedDRL is slightly better than the
on Fashion-MNIST and MNIST show a similar trend. This result
baselines in most cases. Especially, when 𝛿 = 0.6, there are signif-
shows that FedRDL is much more consistent and robust than the
icant improvements in the accuracy of FedDRL over FedAvg and
baseline methods. It is due to the usage of adaptive priorities for
FedProx, e.g., approximately 2 − 3%.
clients based on their local data in the proposed method.
5 DISCUSSION 5.3 Computational overhead of FedDRL
5.1 Impact of the non-IID type As mentioned, our FedDRL is designed for synchronous federated
In this section, we illustrate that our approach does not rely on any learning, yet significant computation time on the server would
particular data type. It is applicable not just to cluster-skew but to prolong the synchronization latency. To ensure that our weighted
label-skew in general. Specifically, we consider two common label aggregation method using DRL does not contribute much compu-
size imbalanced non-IID suggested by [17] as follow: tation overhead at the server versus other baseline FL methods, we
• Equal: non-IID partition method provided by FedAvg [17]. estimate the time to calculate the impact factors (DRL in Figure 9).
The dataset is sorted and separated into 2𝑁 shards, where We also present the time to perform the actual weights aggregation
𝑁 is the number of clients. Each client has samples of two for reference (Aggregation). The result shows that the computa-
shards, so that the total number of samples per client does tion time of the DRL module is trivial, e.g., only 3 milliseconds.
not change among clients. Such overhead is consistent with all the models and datasets. By
• Non-equal: non-IID partition method provided by FedAvg [17]. contrast, the aggregation time of FedDRL depends on the size of
The dataset is sorted and separated into 10𝑁 shards. Each Deep Learning models., e.g., 45 and 3 milliseconds for VGG-11 and
client has samples of a random number of shards ranged CNN, respectively. In short, our FedDRL is practical in terms of
from 6 to 14. computation time at the server.
In Section 4, we showed numerical data regarding the Pareto dis- 6 CONCLUSION
tribution, in which FedDRL demonstrated a significant increase in Despite the popularity of the standard federated learning (FL) Fe-
accuracy compared to other benchmarks. A similar observation was dAvg [17] and its alternatives, there are some critical challenges
observed on the equal and non-equal data distribution. FedDRL, for when applying it to the real-world non-IID data. Most existing
example, outperforms the best baseline FL algorithms by 1.85% to methods conduct their studies on very restricted non-IID data, e.g.,
2.13% on the CIFAR100 dataset (as shown in Table 4.) label-skew and quantity-skew non-IID. In this work, we tackled a
novel non-IID type observed in real-world datasets, where groups
5.2 Convergence rate of clients exhibit a strong correlation and have similar datasets.
Figure 10 presents the comparison of the number of communication We named it the cluster-skew non-IID. To deal with such kind of
rounds it takes for each FL method to achieve a target testing non-IID data, we then proposed a novel aggregation method for
10
ICPP ’22, August 29-September 1, 2022, Bordeaux, France
Figure 10: Convergence rate of different FL methods on CIFAR-100 , Fashion-MNIST, and MNIST datasets.
Federated Learning, namely FedDRL, which exploits Deep Rein- [14] Xiaoxiao Li, Meirui JIANG, Xiaofei Zhang, Michael Kamp, and Qi Dou. 2021.
forcement Learning approach. We performed intensive experiments FedBN: Federated Learning on Non-IID Features via Local Batch Normalization.
In International Conference on Learning Representations.
on three common datasets, e.g., CIFAR-100, Fashion-MNIST, and [15] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez,
MNIST, with three different ways to partition the datasets. The ex- Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with
deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015).
periment result showed that the proposed FedDRL achieves better [16] Jiahuan Luo, Xueyang Wu, Yun Luo, Anbu Huang, Yunfeng Huang, Yang Liu,
accuracy than FedAvg [17] and FedProx [12], e.g., absolutely up to and Qiang Yang. 2019. Real-world image datasets for federated learning. arXiv
10% in accuracy. preprint arXiv:1910.11089 (2019).
[17] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and
Blaise Aguera y Arcas. 2017. Communication-efficient learning of deep net-
ACKNOWLEDGEMENTS works from decentralized data. In Artificial intelligence and statistics. PMLR,
1273–1282.
This work was funded by Vingroup Joint Stock Company (Vingroup [18] Felix Sattler, Simon Wiedemann, Klaus-Robert Müller, and Wojciech Samek.
JSC), Vingroup, and supported by Vingroup Innovation Foundation 2020. Robust and Communication-Efficient Federated Learning From Non-i.i.d.
Data. IEEE Transactions on Neural Networks and Learning Systems 31, 9 (2020),
(VINIF) under project code VINIF.2021.DA00128. 3400–3413. https://doi.org/10.1109/TNNLS.2019.2944481
[19] Karen Simonyan et al. 2015. Very Deep Convolutional Networks for Large-Scale
Image Recognition. In ICLR 2015.
REFERENCES [20] Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An intro-
[1] Tal Ben-Nun and Torsten Hoefler. 2018. Demystifying Parallel and Distributed duction. MIT press.
Deep Learning: An In-Depth Concurrency Analysis. CoRR abs/1802.09941 (2018). [21] Hao Wang, Zakhary Kaplan, Di Niu, and Baochun Li. 2020. Optimizing Federated
arXiv:1802.09941 Learning on Non-IID Data with Reinforcement Learning. In IEEE INFOCOM 2020
[2] Zheng Chai, Yujing Chen, Ali Anwar, Liang Zhao, Yue Cheng, and Huzefa Rang- - IEEE Conference on Computer Communications. 1698–1707. https://doi.org/10.
wala. 2021. FedAT: A High-Performance and Communication-Efficient Federated 1109/INFOCOM41043.2020.9155494
Learning System with Asynchronous Tiers. In Proceedings of the International [22] Jianyu Wang, Qinghua Liu, Hao Liang, Gauri Joshi, and H. Vincent Poor. 2021.
Conference for High Performance Computing, Networking, Storage and Analysis A Novel Framework for the Analysis and Design of Heterogeneous Federated
(SC ’21). Article 60, 16 pages. Learning. IEEE Transactions on Signal Processing 69 (2021), 5234–5249. https:
[3] Yae Jee Cho, Jianyu Wang, and Gauri Joshi. 2020. Client selection in federated //doi.org/10.1109/TSP.2021.3106104
learning: Convergence analysis and power-of-choice selection strategies. arXiv [23] Zheng Wang, Xiaoliang Fan, Jianzhong Qi, Chenglu Wen, Cheng Wang, and
preprint arXiv:2010.01243 (2020). Rongshan Yu. 2021. Federated Learning with Fair Averaging. In Proceedings
[4] Pengchao Han, Shiqiang Wang, and Kin K. Leung. 2020. Adaptive Gradient of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21,
Sparsification for Efficient Federated Learning: An Online Learning Approach. Zhi-Hua Zhou (Ed.). International Joint Conferences on Artificial Intelligence
In 2020 IEEE 40th International Conference on Distributed Computing Systems Organization, 1615–1623. Main Track.
(ICDCS). 300–310. [24] Zibo Wang, Yifei Zhu, Dan Wang, and Zhu Han. 2021. FedACS: Federated
[5] Howlader N, Noone AM, Krapcho M, et al. (eds), National Cancer Institute. 2021. Skewness Analytics in Heterogeneous Decentralized Data Environments. In 2021
SEER Cancer Statistics Review 1975-2018. https://seer.cancer.gov/csr/1975_2018/ IEEE/ACM 29th International Symposium on Quality of Service (IWQOS). 1–10.
browse_csr.php. [19 April 2021]. [25] Hongda Wu and Ping Wang. 2021. Fast-Convergent Federated Learning With
[6] Kevin Hsieh, Amar Phanishayee, Onur Mutlu, and Phillip Gibbons. 2020. The Adaptive Weighting. IEEE Transactions on Cognitive Communications and Net-
Non-IID Data Quagmire of Decentralized Machine Learning. In Proceedings of working 7, 4 (2021), 1078–1088. https://doi.org/10.1109/TCCN.2021.3084406
the 37th International Conference on Machine Learning (Proceedings of Machine [26] Han Xiao, Kashif Rasul, and Roland Vollgraf. 2017. Fashion-mnist: a novel
Learning Research, Vol. 119). PMLR, 4387–4398. image dataset for benchmarking machine learning algorithms. arXiv preprint
[7] Kevin Hsieh, Amar Phanishayee, Onur Mutlu, and Phillip Gibbons. 2020. The non- arXiv:1708.07747 (2017).
iid data quagmire of decentralized machine learning. In International Conference [27] Peng Xiao, Samuel Cheng, Vladimir Stankovic, and Dejan Vukobratovic. 2020. Av-
on Machine Learning. PMLR, 4387–4398. eraging is probably not the optimum way of aggregating parameters in federated
[8] Wei Huang, Tianrui Li, Dexian Wang, Shengdong Du, and Junbo Zhang. 2020. learning. Entropy 22, 3 (2020), 314.
Fairness and accuracy in federated learning. arXiv preprint arXiv:2012.10069 [28] He Yang. 2021. H-FL: A Hierarchical Communication-Efficient and Privacy-
(2020). Protected Architecture for Federated Learning. In Proceedings of the Thirtieth
[9] A. Krizhevsky and G. Hinton. 2009. Learning multiple layers of features from International Joint Conference on Artificial Intelligence, IJCAI-21. 479–485. Main
tiny images. Master’s thesis, Dep. of Comp. Sci. Univ. of Toronto (2009). Track.
[10] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. 1998. Gradient-based learning [29] Hui Zeng, Tongqing Zhou, Yeting Guo, Zhiping Cai, and Fang Liu. 2021. Fed-
applied to document recognition. Proc. IEEE 86, 11 (1998), 2278–2324. https: Cav: Contribution-Aware Model Aggregation on Distributed Heterogeneous Data in
//doi.org/10.1109/5.726791 Federated Learning. Association for Computing Machinery, New York, NY, USA.
[11] Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. 2020. Federated [30] Daniel Yue Zhang, Ziyi Kou, and Dong Wang. 2021. FedSens: A Federated
learning: Challenges, methods, and future directions. IEEE Signal Processing Learning Approach for Smart Health Sensing with Class Imbalance in Resource
Magazine 37, 3 (2020), 50–60. Constrained Edge Computing. In IEEE INFOCOM 2021 - IEEE Conference on
[12] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, Computer Communications. 1–10.
and Virginia Smith. 2020. Federated optimization in heterogeneous networks. [31] Hangyu Zhu, Jinjin Xu, Shiqing Liu, and Yaochu Jin. 2021. Federated Learning
Proceedings of Machine Learning and Systems 2 (2020), 429–450. on Non-IID Data: A Survey. arXiv preprint arXiv:2106.06843 (2021).
[13] Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, and Zhihua Zhang. 2020.
On the Convergence of FedAvg on Non-IID Data. In 8th International Conference
on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020.
11