1FL 2024

Data Fusion – KEN4223
Lecture 1
Data Fusion course - Team
• Anna Wilbik
professor in data fusion
and intelligent interaction
• Marcin Pietrasik
• Post-doc in data fusion
• Afsana Khan
PhD student in federated learning
Anna Wilbik
Who are you?
(https://www.flickr.com/photos/saulalbert/37545736336)
Data fusion
• A process dealing with the association, correlation, and combination of data and
information from single and multiple sources to achieve refined position and identity
estimates, and complete and timely assessments of situations and threats as well as
their significance
• Data fusion is a formal framework in which are expressed means and tools for the
alliance of data originating from different sources. It aims at obtaining information of
greater quality; the exact definition of ‘greater quality’ will depend upon the
application
• Data Fusion is the analysis of several data sets such that different data sets can
interact and inform each other
• DF is a framework, fit by an ensemble of tools, for the joint analysis of data from
multiple sources (modalities) that allows achieving information/knowledge not
recoverable by the individual ones.
Taxonomy of data fusion
Federated learning as model fusion
https://towardsdatascience.com/introduction-to-ibm-federated-learning-a-
https://doi.org/10.1016/B978-0-444-63984-4.00001-6 collaborative-approach-to-train-ml-models-on-private-data-2b4221c3839
Schedule (1)
Date Topic
06.02.2024 08:30-10:30 Lecture: Introduction. Federated learning (1)
08.02.2024 08:30-10:30 Lecture: Federated learning (2)
20.02.2024 08:30-10:30 Lab: Federated learning
21.02.2024 11:00-13:00 Lecture: High level fusion
22.02.2024 11:00-13:00 Guest Lecture: Industry perspective
28.02.2024 11:00-13:00 Lab: High level fusion
29.02.2024 11:00-13:00 Lecture: Mid-level fusion (1)
06.03.2024 11:00-13:00 Lecture: Mid-level fusion (2)
07.03.2024 11:00-13:00 Lab: Mid-level fusion
12.03.2024 09:00-10:30* Q&A
Schedule (2)
Date Topic
13.03.2024 11:00-13:00 Lecture: Low level fusion
14.03.2024 11:00-13:00 Lab: Low level fusion
20.03.2024 11:00-13:00 Lecture: Outcome Economy
21.03.2024 11:00-13:00 Lab: Outcome Economy
26.03.2024 09:00-10:30* Exam preparation (Q&A)
27.03.2024 11:00-13:00 Assignment presentations
28.03.2024 13:00 Assignment deadline
Materials
• Handouts from the lectures
• Scientific articles – please check Canvas
Grading
• Assignment (A) (0-10) 30% weight
• Written exam (B) (0-10) 70% weight
• Bonus (C) (0.25)
Final grade = MAX(1,MIN(10,ROUND(0.3*A+0.7*B+C)))
Assignment:
• in groups – use different data fusion techniques on a data set that is provided
• Presentation presence – compulsory (penalty for absent group member)
Resit:
• Written exam - Regular resit in the ongoing academic year
• Assignment – Special repair project for groups with initial score between 4 and 6.
Repair score is 6.
Communication
• Lectures
• Instructions
• Canvas forum
• Please don’t send us emails
about the course content – use Canvas!
(https://universalscribbler.wordpress.com/2012/07/16/wor
k-dealing-with-email-overflow/)
Federated learning
Taxonomy of data fusion
Federated learning as model fusion
https://towardsdatascience.com/introduction-to-ibm-federated-learning-a-
https://doi.org/10.1016/B978-0-444-63984-4.00001-6 collaborative-approach-to-train-ml-models-on-private-data-2b4221c3839
From centralized to decentralized data
≠
Possible options
• Collect data centrally, anyway

• Use only local data
• A new solution?
Why can’t we just centralize the data?
• Sending the data may be too costly

- Self-driving cars are expected to generate several TBs of data a day
- Some wireless devices have limited bandwidth/power
• Data may be considered too sensitive
- Growing public awareness and regulations on data privacy
- Competitive advantage in business and research
How about each party learning on its own?
• The local dataset may be too small

- Sub-par predictive performance (e.g., due to overfitting)
- Non-statistically significant results (e.g., medical studies)
• The local dataset may be biased

- Not representative of the target distribution
Possible options
• Collect data centrally, anyway

• Use only local data
• A new solution? - Federated learning
Federated learning
“We advocate an alternative that leaves the training data

distributed on the mobile devices, and learns a shared model by
aggregating locally-computed updates. We term this decentralized
approach Federated Learning.”
McMahan et al. , Communication-Efficient Learning of Deep Networks from Decentralized

Data, 2016.
Federated learning
“Federated learning is a machine learning setting where multiple

entities (clients) collaborate in solving a machine learning problem,
under the coordination of a central server or service provider. Each
client’s raw data is stored locally and not exchanged or transferred;
instead focused updates intended for immediate aggregation are
used to achieve the learning objective.”
Kairouz et al., Advances and open problems in federated learning, 2019.

Federated learning
“collaborative learning without exchanging users’ original data”
Li et al., A survey on federated learning systems: vision, hype and reality for data privacy
and protection, 2019.
Key differences with distributed learning
Data distribution
• In distributed learning, data is centrally stored (e.g., in a data
center)
- The main goal is just to train faster
- We control how data is distributed across workers: usually, it is
distributed uniformly at random across workers
• In FL, data is naturally distributed and generated locally
- Data is not independent and identically distributed (non-i.i.d.), and it is
imbalanced
FL – area under development
Web of science
publications
Record Count
3000
2500
2000
1500
1000
500
0
2016 2017 2018 2019 2020 2021 2022 2023 2024
Gboard: next-word prediction
• Federated RNN (compared to prior n-gram model):
• Better next-word prediction accuracy: +24%
• More useful prediction strip: +10% more clicks
Hard et al. Federated Learning

for Mobile Keyboard Prediction,
arXiv:1811.03604
https://medcitynews.com/2020/05/upenn-intel-partner-to-use-federated-learning-
ai-for-early-brain-tumor-detection/
https://www.technologyreview.com/2019/12/11/131629/apple-ai- https://blogs.nvidia.com/blog/2020/04/15/federated-learning-mammogram-
personalizes-siri-federated-learning/ assessment/
Taxonomy of Federated Learning
Federated learning systems
Data Machine Privacy Communication Scale of Motivation for

partitioning learning model mechanisms architecture federation federation
- horizontal - linear models - differential - centralized - cross-silo - incentive

- vertical - neural networks privacy - decentralized - cross-device - regulation
- hybrid -… - cryptographic
methods
Li et al., A survey on federated learning systems: vision, hype and reality for data privacy and protection,
arXiv preprint arXiv:1907.09693, 2019.
Data partitioning
Horizontal FL Vertical FL
Data Data
from A from A
labels Data
labels
from B
Data
from B
Horizontal FL
Vertical FL
Hybrid FL
Communication architecture
Server-orchestrated FL Fully decentralized FL
Scale of federation
Cross-silo federated learning Cross-device federated learning
Training a model on siloed data. Clients are The clients are a very large number of
different organizations (e.g. medical or mobile or IoT devices.
financial) or geo-distributed datacenters.
All clients are almost always available. Only a fraction of clients are available at
any one time, often with diurnal or other
variations.
Typically 2 - 100 clients. Massively parallel, up to 1010 clients.
Relatively few failures. Highly unreliable — 5% or more of the
clients participating in a round of
computation are expected to fail or drop
out
Partition is fixed. Horizontal or vertical Fixed horizontal partition.
Motivation of federation
• Incentive
- Obtaining a better model
- Compensation for sharing data
- …
• Regulation
The Lifecycle of a Model in Federated Learning
1. Problem identification
2. Client instrumentation
3. Simulation prototyping (optional)
4. Federated model training
5. (Federated) model evaluation
6. Deployment
Horizontal federated
learning
Data
from A
labels
Data
from B
How does it work?
1. Client selection
2. Broadcast
3. Client computation
4. Aggregation
5. Model update
How does it work?
How does it work?
………
………
………
………
………
How does it work?
………
………
………
………
………
………
………
………
How does it work?
………
………
………
………
………
………
………
………
How does it work?
……… ………
………
………
……… ………
………
How does it work?
………
………
………
………
………
Federated learning - objective
Goal:
𝐾𝐾 𝑛𝑛𝑘𝑘
𝑖𝑖 (𝑖𝑖)
min � 𝑝𝑝𝑘𝑘 � ℒ 𝑓𝑓 𝑥𝑥𝑘𝑘 , 𝜃𝜃 , 𝑦𝑦𝑘𝑘
𝜃𝜃 𝑘𝑘=1 𝑖𝑖=1
where:
𝑝𝑝𝑘𝑘 - weight of party k
ℒ � - loss function
Gradient descent -
recap
Gradient Descent
Derivative
slope of the tangent line
Partial derivative – multivariate functions
Partial derivative
Partial derivative
Gradient vector
Is the vector that has as coordinates the partial derivatives of the
function
Gradient Descent Algorithm
• Idea
- Start somewhere
- Take steps based on the gradient
vector of the current position till
convergence
• Convergence
- Change between two steps < ε
Stochastic Gradient Descent (SGD)
• At each step of gradient descent, instead of compute for all
training samples, randomly pick a small subset (mini-batch) of
training samples (xk,yk) .
𝑤𝑤t+1 ← 𝑤𝑤t − 𝜂𝜂∇𝑓𝑓(𝑤𝑤t; xk, yk)
• Compared to gradient descent, SGD takes more steps to

converge, but each step is much faster.
Federated averaging
(FedAvg)
Basic notation
• We consider a set of K parties (clients)
• Each party k holds a dataset Dk of nk points
• Let 𝐷𝐷 = 𝐷𝐷1 ∪ ⋯ ∪ 𝐷𝐷𝐾𝐾 be the joint dataset and 𝑛𝑛 = ∑𝑘𝑘 𝑛𝑛𝑘𝑘 the total number
of points
• We want to solve problems of the form min𝑝𝑝 𝐹𝐹(𝜃𝜃; 𝐷𝐷) where:
𝜃𝜃𝜃𝜃𝑅𝑅
𝑛𝑛𝑘𝑘
𝐹𝐹 𝜃𝜃; 𝐷𝐷 =∑𝐾𝐾 𝜃𝜃; 𝐷𝐷𝑘𝑘 and 𝐹𝐹𝑘𝑘 𝜃𝜃; 𝐷𝐷𝑘𝑘 = ∑𝑑𝑑𝜖𝜖𝐷𝐷𝑘𝑘 𝑓𝑓(𝜃𝜃; 𝑑𝑑)
𝑘𝑘=1 𝑛𝑛 𝐹𝐹𝑘𝑘
• θ ∈ Rp are model parameters (e.g., weights of a logistic regression or neural
network)
• This covers a broad class of ML problems formulated as empirical risk
minimization
FEDAVG [McMahan et al.]
FEDAVG [McMahan et al.]
• FedAvg with L > 1 allows to reduce
the number of communication
rounds, which is often the
bottleneck in FL (especially in
the cross-device setting)
• It empirically achieves better generalization than parallel SGD
with large mini-batch
• Convergence to the optimal model can be guaranteed for i.i.d.
data, but issues arise in strongly non-i.i.d. case
FedAvg is more than model averaging
[McMahan et al.]
Linear regression
𝒙𝒙 = 𝑥𝑥1 , 𝑥𝑥2 , … , 𝑥𝑥𝑚𝑚 - data sample
𝑦𝑦 = 𝑓𝑓 𝜽𝜽, 𝒙𝒙 = 𝜃𝜃0 + ∑𝑚𝑚
𝑗𝑗=1 𝜃𝜃𝑗𝑗 𝑥𝑥𝑗𝑗 ,
where 𝜽𝜽 = 𝜃𝜃0 , 𝜃𝜃1 , … , 𝜃𝜃𝑚𝑚 regression coefficients
𝐷𝐷 = 𝒙𝒙(𝑙𝑙) , 𝑦𝑦 (𝑙𝑙) , 𝑙𝑙 = 1 … 𝑛𝑛 training sample (among K clients)
Minimize loss:
1 𝑛𝑛 2
ℒ 𝜽𝜽 = � 𝑓𝑓 𝜽𝜽, 𝒙𝒙(𝑙𝑙) − 𝑦𝑦 (𝑙𝑙)
2𝑛𝑛 𝑙𝑙=1
https://doi.org/10.1016/j.ins.2020.12.007
Linear regression
With gradient descent 𝜽𝜽𝑖𝑖+1 = 𝜽𝜽𝑖𝑖 − 𝛼𝛼𝛻𝛻𝜽𝜽 ℒ(𝜽𝜽), where
𝜽𝜽0 – random initialization,
𝛼𝛼 – learning rate
𝛻𝛻𝜽𝜽 ℒ(𝜽𝜽) – gradient of ℒ(𝜽𝜽) with respect to 𝜽𝜽.
Hence for the k-th client:

𝑖𝑖+1 𝑖𝑖 𝛼𝛼 𝑛𝑛𝑘𝑘 (𝑙𝑙) ) − 𝑦𝑦 (𝑙𝑙)
𝜃𝜃0 = 𝜃𝜃0 − ∑ 𝑙𝑙=1 𝑓𝑓 𝜽𝜽, 𝒙𝒙
𝑛𝑛𝑘𝑘
𝛼𝛼 (𝑙𝑙)
𝜃𝜃𝑗𝑗𝑖𝑖+1 = 𝜃𝜃0𝑖𝑖 − ∑𝑛𝑛𝑙𝑙=1
𝑘𝑘
𝑓𝑓 𝜽𝜽, 𝒙𝒙(𝑙𝑙) ) − 𝑦𝑦 (𝑙𝑙) 𝑥𝑥𝑗𝑗
𝑛𝑛𝑘𝑘
Ridge regression
Linear regression with ℓ2 -norm regularization
Loss:
1 𝑛𝑛 2
ℒ 𝜽𝜽 = � 𝑓𝑓 𝜽𝜽, 𝒙𝒙(𝑙𝑙) − 𝑦𝑦 (𝑙𝑙) + 𝜆𝜆 𝜽𝜽 2
2𝑛𝑛 𝑙𝑙=1
𝛼𝛼 𝑛𝑛𝑘𝑘
𝜃𝜃0𝑖𝑖+1 = 1 − 2𝜆𝜆𝜆𝜆 𝜃𝜃0𝑖𝑖 − ∑𝑙𝑙=1 𝑓𝑓 𝜽𝜽, 𝒙𝒙(𝑙𝑙) ) − 𝑦𝑦 (𝑙𝑙)
𝑛𝑛𝑘𝑘
𝛼𝛼 (𝑙𝑙)
𝜃𝜃𝑗𝑗𝑖𝑖+1 = 1 − 2𝜆𝜆𝜆𝜆 𝜃𝜃0𝑖𝑖 − ∑𝑛𝑛𝑙𝑙=1
𝑘𝑘
𝑛𝑛𝑘𝑘
Logistic regression
1
𝑦𝑦 =
1 + 𝑒𝑒 −𝒇𝒇(𝜽𝜽,𝒙𝒙)
Loss function can be approximated by second-order Taylor series
and ℓ2 -norm regularization
1 𝑛𝑛 1 𝑙𝑙 1 2
ℒ 𝜽𝜽 = � log 2 − 𝑦𝑦 𝑓𝑓 𝜽𝜽, 𝒙𝒙 𝑙𝑙 + 𝑓𝑓 𝜽𝜽, 𝒙𝒙 𝑙𝑙
𝑛𝑛 𝑙𝑙=1 2 8
+ 𝜆𝜆 𝜽𝜽 2
Logistic regression
𝛼𝛼 1 1
𝜃𝜃0𝑖𝑖+1 = 1 − 2𝜆𝜆𝜆𝜆 𝜃𝜃0𝑖𝑖 − ∑𝑛𝑛𝑙𝑙=1
𝑘𝑘
𝑓𝑓 𝜽𝜽, 𝒙𝒙(𝑙𝑙) ) − 𝑦𝑦 (𝑙𝑙)
𝑛𝑛𝑘𝑘 4 2
𝛼𝛼 1 1 (𝑙𝑙)
𝜃𝜃𝑗𝑗𝑖𝑖+1 = 1 − 2𝜆𝜆𝜆𝜆 𝜃𝜃0𝑖𝑖 − ∑𝑛𝑛𝑙𝑙=1
𝑘𝑘
𝑛𝑛𝑘𝑘 4 2
Some challenges
Expensive communication
Can reduce communication in federated optimization by:
• Limiting number of devices involved in communication
• Reducing number of communication rounds
• Reducing size of messages sent over network
FedAvg
At each communication round:
• (i) run SGD locally, then
• (ii) average the model updates
Reduces communication by:
• (i) performing local updating,
• (ii) communicating with a subset of devices
Why is it useful to perform `local-updating`?

Communication-efficient FL
Reduce the size of messages
Common approaches:
• Dimensionality reduction (low-rank, sparsity)
- Directly learn model updates that have reduced dimension/size
• Compression
- Take regular (full dimension) updates and then compress them
Heterogeneity
Heterogeneous (i.e., non-identically distributed) data and systems
can bias optimization procedures
Non-IID Data in Federated Learning
Types of non-IID data:
• Feature distribution skew (covariate shift)
• Label distribution skew (prior probability shift)
• Same label, different features (concept shift)
• Same features, different label (concept shift)
• Quantity skew or unbalancedness
Drift in FedAvg
(arXiv:1910.06378)
Possible solutions
(Zhu et al. Federated Learning on Non-IID Data: A Survey)

Possible solutions
• SCAFFOLD: Stochastic Controlled Averaging for Federated
Learning (arXiv:1910.06378)
- uses control variates (variance reduction) to correct for the ‘client-drift’ in
its local updates
• FedProx - a federated optimization algorithm with a proximal
term (Li, et al., Federated optimization in heterogeneous
networks, MLSys, 2020)
- adds a proximal term to the local sub-problem to effectively limit the
impact of variable local updates
Federated learning of personalized models
• Learning from non-i.i.d. data is difficult/slow because each party wants the
model to go in a particular direction
• If data distributions are very different, learning a single model which
performs well for all parties may require a very large number of parameters
• Another direction to deal with non-i.i.d. data is thus to lift the requirement
that the learned model should be the same for all parties (“one size fits all”)
• Instead, we can allow each party k to learn a (potentially simpler)
personalized model θk but design the objective so as to enforce some kind of
collaboration
Approaches for personalization
• Multi-task learning
- Jointly learn shared, yet personalized models
• Fine-tuning
- Learn a global model, then “fine-tune”/adapt it on local data
- See also: transfer learning, domain adaptation
• Meta learning (initialization-based)
- Learn initialization over multiple tasks, then train locally
Exam material
• Slides
• McMahan et al., Communication-Efficient Learning of Deep
Networks from Decentralized Data, 2016.
Next…
• Thursday:
Lecture - Vertical federated learning (Afsana)
• Next week:
No education - carnival

1FL 2024

Uploaded by

Copyright:

Available Formats

You might also like

1FL 2024

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1FL 2024

Uploaded by

Copyright:

Available Formats

Data Fusion – KEN4223

• Collect data centrally, anyway

• Sending the data may be too costly

• The local dataset may be too small

• The local dataset may be biased

• Collect data centrally, anyway

“We advocate an alternative that leaves the training data

McMahan et al. , Communication-Efficient Learning of Deep Networks from Decentralized

“Federated learning is a machine learning setting where multiple

Kairouz et al., Advances and open problems in federated learning, 2019.

“collaborative learning without exchanging users’ original data”

Hard et al. Federated Learning

Data Machine Privacy Communication Scale of Motivation for

- horizontal - linear models - differential - centralized - cross-silo - incentive

• Compared to gradient descent, SGD takes more steps to

Hence for the k-th client:

Why is it useful to perform `local-updating`?

(Zhu et al. Federated Learning on Non-IID Data: A Survey)

You might also like