1FL 2024

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 75

Data Fusion – KEN4223

Lecture 1
Data Fusion course - Team
• Anna Wilbik
professor in data fusion
and intelligent interaction

• Marcin Pietrasik
• Post-doc in data fusion

• Afsana Khan
PhD student in federated learning
Anna Wilbik
Who are you?

(https://www.flickr.com/photos/saulalbert/37545736336)
Data fusion
• A process dealing with the association, correlation, and combination of data and
information from single and multiple sources to achieve refined position and identity
estimates, and complete and timely assessments of situations and threats as well as
their significance
• Data fusion is a formal framework in which are expressed means and tools for the
alliance of data originating from different sources. It aims at obtaining information of
greater quality; the exact definition of ‘greater quality’ will depend upon the
application
• Data Fusion is the analysis of several data sets such that different data sets can
interact and inform each other
• DF is a framework, fit by an ensemble of tools, for the joint analysis of data from
multiple sources (modalities) that allows achieving information/knowledge not
recoverable by the individual ones.
Taxonomy of data fusion
Federated learning as model fusion

https://towardsdatascience.com/introduction-to-ibm-federated-learning-a-
https://doi.org/10.1016/B978-0-444-63984-4.00001-6 collaborative-approach-to-train-ml-models-on-private-data-2b4221c3839
Schedule (1)
Date Topic
06.02.2024 08:30-10:30 Lecture: Introduction. Federated learning (1)
08.02.2024 08:30-10:30 Lecture: Federated learning (2)
20.02.2024 08:30-10:30 Lab: Federated learning
21.02.2024 11:00-13:00 Lecture: High level fusion
22.02.2024 11:00-13:00 Guest Lecture: Industry perspective
28.02.2024 11:00-13:00 Lab: High level fusion
29.02.2024 11:00-13:00 Lecture: Mid-level fusion (1)
06.03.2024 11:00-13:00 Lecture: Mid-level fusion (2)
07.03.2024 11:00-13:00 Lab: Mid-level fusion
12.03.2024 09:00-10:30* Q&A
Schedule (2)

Date Topic
13.03.2024 11:00-13:00 Lecture: Low level fusion
14.03.2024 11:00-13:00 Lab: Low level fusion
20.03.2024 11:00-13:00 Lecture: Outcome Economy
21.03.2024 11:00-13:00 Lab: Outcome Economy
26.03.2024 09:00-10:30* Exam preparation (Q&A)
27.03.2024 11:00-13:00 Assignment presentations
28.03.2024 13:00 Assignment deadline
Materials
• Handouts from the lectures
• Scientific articles – please check Canvas
Grading
• Assignment (A) (0-10) 30% weight
• Written exam (B) (0-10) 70% weight
• Bonus (C) (0.25)
Final grade = MAX(1,MIN(10,ROUND(0.3*A+0.7*B+C)))

Assignment:
• in groups – use different data fusion techniques on a data set that is provided
• Presentation presence – compulsory (penalty for absent group member)
Resit:
• Written exam - Regular resit in the ongoing academic year
• Assignment – Special repair project for groups with initial score between 4 and 6.
Repair score is 6.
Communication
• Lectures
• Instructions
• Canvas forum
• Please don’t send us emails
about the course content – use Canvas!

(https://universalscribbler.wordpress.com/2012/07/16/wor
k-dealing-with-email-overflow/)
Federated learning
Taxonomy of data fusion
Federated learning as model fusion

https://towardsdatascience.com/introduction-to-ibm-federated-learning-a-
https://doi.org/10.1016/B978-0-444-63984-4.00001-6 collaborative-approach-to-train-ml-models-on-private-data-2b4221c3839
From centralized to decentralized data


Possible options

• Collect data centrally, anyway


• Use only local data
• A new solution?
Why can’t we just centralize the data?

• Sending the data may be too costly


- Self-driving cars are expected to generate several TBs of data a day
- Some wireless devices have limited bandwidth/power
• Data may be considered too sensitive
- Growing public awareness and regulations on data privacy
- Competitive advantage in business and research
How about each party learning on its own?

• The local dataset may be too small


- Sub-par predictive performance (e.g., due to overfitting)
- Non-statistically significant results (e.g., medical studies)

• The local dataset may be biased


- Not representative of the target distribution
Possible options

• Collect data centrally, anyway


• Use only local data
• A new solution? - Federated learning
Federated learning

“We advocate an alternative that leaves the training data


distributed on the mobile devices, and learns a shared model by
aggregating locally-computed updates. We term this decentralized
approach Federated Learning.”

McMahan et al. , Communication-Efficient Learning of Deep Networks from Decentralized


Data, 2016.
Federated learning

“Federated learning is a machine learning setting where multiple


entities (clients) collaborate in solving a machine learning problem,
under the coordination of a central server or service provider. Each
client’s raw data is stored locally and not exchanged or transferred;
instead focused updates intended for immediate aggregation are
used to achieve the learning objective.”

Kairouz et al., Advances and open problems in federated learning, 2019.


Federated learning

“collaborative learning without exchanging users’ original data”

Li et al., A survey on federated learning systems: vision, hype and reality for data privacy
and protection, 2019.
Key differences with distributed learning
Data distribution
• In distributed learning, data is centrally stored (e.g., in a data
center)
- The main goal is just to train faster
- We control how data is distributed across workers: usually, it is
distributed uniformly at random across workers
• In FL, data is naturally distributed and generated locally
- Data is not independent and identically distributed (non-i.i.d.), and it is
imbalanced
FL – area under development
Web of science
publications
Record Count
3000

2500

2000

1500

1000

500

0
2016 2017 2018 2019 2020 2021 2022 2023 2024
Gboard: next-word prediction
• Federated RNN (compared to prior n-gram model):
• Better next-word prediction accuracy: +24%
• More useful prediction strip: +10% more clicks

Hard et al. Federated Learning


for Mobile Keyboard Prediction,
arXiv:1811.03604
https://medcitynews.com/2020/05/upenn-intel-partner-to-use-federated-learning-
ai-for-early-brain-tumor-detection/

https://www.technologyreview.com/2019/12/11/131629/apple-ai- https://blogs.nvidia.com/blog/2020/04/15/federated-learning-mammogram-
personalizes-siri-federated-learning/ assessment/
Taxonomy of Federated Learning
Federated learning systems

Data Machine Privacy Communication Scale of Motivation for


partitioning learning model mechanisms architecture federation federation

- horizontal - linear models - differential - centralized - cross-silo - incentive


- vertical - neural networks privacy - decentralized - cross-device - regulation
- hybrid -… - cryptographic
methods

Li et al., A survey on federated learning systems: vision, hype and reality for data privacy and protection,
arXiv preprint arXiv:1907.09693, 2019.
Data partitioning
Horizontal FL Vertical FL

Data Data
from A from A
labels Data

labels
from B
Data
from B
Horizontal FL
Vertical FL
Hybrid FL
Communication architecture
Server-orchestrated FL Fully decentralized FL
Scale of federation
Cross-silo federated learning Cross-device federated learning
Training a model on siloed data. Clients are The clients are a very large number of
different organizations (e.g. medical or mobile or IoT devices.
financial) or geo-distributed datacenters.
All clients are almost always available. Only a fraction of clients are available at
any one time, often with diurnal or other
variations.
Typically 2 - 100 clients. Massively parallel, up to 1010 clients.
Relatively few failures. Highly unreliable — 5% or more of the
clients participating in a round of
computation are expected to fail or drop
out
Partition is fixed. Horizontal or vertical Fixed horizontal partition.
Motivation of federation
• Incentive
- Obtaining a better model
- Compensation for sharing data
- …
• Regulation
The Lifecycle of a Model in Federated Learning
1. Problem identification
2. Client instrumentation
3. Simulation prototyping (optional)
4. Federated model training
5. (Federated) model evaluation
6. Deployment
Horizontal federated
learning
Data
from A

labels
Data
from B
How does it work?
1. Client selection
2. Broadcast
3. Client computation
4. Aggregation
5. Model update
How does it work?
How does it work?

………

………
………
………

………
How does it work?

………

………
………
………

………
………
………

………
How does it work?

………

………

………
………

………
………
………

………
How does it work?

……… ………

………
………

……… ………

………
How does it work?

………

………
………

………
………
Federated learning - objective
Goal:
𝐾𝐾 𝑛𝑛𝑘𝑘
𝑖𝑖 (𝑖𝑖)
min � 𝑝𝑝𝑘𝑘 � ℒ 𝑓𝑓 𝑥𝑥𝑘𝑘 , 𝜃𝜃 , 𝑦𝑦𝑘𝑘
𝜃𝜃 𝑘𝑘=1 𝑖𝑖=1
where:
𝑝𝑝𝑘𝑘 - weight of party k
ℒ � - loss function
Gradient descent -
recap
Gradient Descent
Derivative
slope of the tangent line
Partial derivative – multivariate functions
Partial derivative
Partial derivative
Gradient vector
Is the vector that has as coordinates the partial derivatives of the
function
Gradient Descent Algorithm
• Idea
- Start somewhere
- Take steps based on the gradient
vector of the current position till
convergence
• Convergence
- Change between two steps < ε
Stochastic Gradient Descent (SGD)
• At each step of gradient descent, instead of compute for all
training samples, randomly pick a small subset (mini-batch) of
training samples (xk,yk) .
𝑤𝑤t+1 ← 𝑤𝑤t − 𝜂𝜂∇𝑓𝑓(𝑤𝑤t; xk, yk)

• Compared to gradient descent, SGD takes more steps to


converge, but each step is much faster.
Federated averaging
(FedAvg)
Basic notation
• We consider a set of K parties (clients)
• Each party k holds a dataset Dk of nk points
• Let 𝐷𝐷 = 𝐷𝐷1 ∪ ⋯ ∪ 𝐷𝐷𝐾𝐾 be the joint dataset and 𝑛𝑛 = ∑𝑘𝑘 𝑛𝑛𝑘𝑘 the total number
of points
• We want to solve problems of the form min𝑝𝑝 𝐹𝐹(𝜃𝜃; 𝐷𝐷) where:
𝜃𝜃𝜃𝜃𝑅𝑅
𝑛𝑛𝑘𝑘
𝐹𝐹 𝜃𝜃; 𝐷𝐷 =∑𝐾𝐾 𝜃𝜃; 𝐷𝐷𝑘𝑘 and 𝐹𝐹𝑘𝑘 𝜃𝜃; 𝐷𝐷𝑘𝑘 = ∑𝑑𝑑𝜖𝜖𝐷𝐷𝑘𝑘 𝑓𝑓(𝜃𝜃; 𝑑𝑑)
𝑘𝑘=1 𝑛𝑛 𝐹𝐹𝑘𝑘
• θ ∈ Rp are model parameters (e.g., weights of a logistic regression or neural
network)
• This covers a broad class of ML problems formulated as empirical risk
minimization
FEDAVG [McMahan et al.]
FEDAVG [McMahan et al.]
• FedAvg with L > 1 allows to reduce
the number of communication
rounds, which is often the
bottleneck in FL (especially in
the cross-device setting)
• It empirically achieves better generalization than parallel SGD
with large mini-batch
• Convergence to the optimal model can be guaranteed for i.i.d.
data, but issues arise in strongly non-i.i.d. case
FedAvg is more than model averaging

[McMahan et al.]
Linear regression
𝒙𝒙 = 𝑥𝑥1 , 𝑥𝑥2 , … , 𝑥𝑥𝑚𝑚 - data sample
𝑦𝑦 = 𝑓𝑓 𝜽𝜽, 𝒙𝒙 = 𝜃𝜃0 + ∑𝑚𝑚
𝑗𝑗=1 𝜃𝜃𝑗𝑗 𝑥𝑥𝑗𝑗 ,
where 𝜽𝜽 = 𝜃𝜃0 , 𝜃𝜃1 , … , 𝜃𝜃𝑚𝑚 regression coefficients
𝐷𝐷 = 𝒙𝒙(𝑙𝑙) , 𝑦𝑦 (𝑙𝑙) , 𝑙𝑙 = 1 … 𝑛𝑛 training sample (among K clients)

Minimize loss:
1 𝑛𝑛 2
ℒ 𝜽𝜽 = � 𝑓𝑓 𝜽𝜽, 𝒙𝒙(𝑙𝑙) − 𝑦𝑦 (𝑙𝑙)
2𝑛𝑛 𝑙𝑙=1

https://doi.org/10.1016/j.ins.2020.12.007
Linear regression
With gradient descent 𝜽𝜽𝑖𝑖+1 = 𝜽𝜽𝑖𝑖 − 𝛼𝛼𝛻𝛻𝜽𝜽 ℒ(𝜽𝜽), where
𝜽𝜽0 – random initialization,
𝛼𝛼 – learning rate
𝛻𝛻𝜽𝜽 ℒ(𝜽𝜽) – gradient of ℒ(𝜽𝜽) with respect to 𝜽𝜽.

Hence for the k-th client:


𝑖𝑖+1 𝑖𝑖 𝛼𝛼 𝑛𝑛𝑘𝑘 (𝑙𝑙) ) − 𝑦𝑦 (𝑙𝑙)
𝜃𝜃0 = 𝜃𝜃0 − ∑ 𝑙𝑙=1 𝑓𝑓 𝜽𝜽, 𝒙𝒙
𝑛𝑛𝑘𝑘

𝛼𝛼 (𝑙𝑙)
𝜃𝜃𝑗𝑗𝑖𝑖+1 = 𝜃𝜃0𝑖𝑖 − ∑𝑛𝑛𝑙𝑙=1
𝑘𝑘
𝑓𝑓 𝜽𝜽, 𝒙𝒙(𝑙𝑙) ) − 𝑦𝑦 (𝑙𝑙) 𝑥𝑥𝑗𝑗
𝑛𝑛𝑘𝑘
Ridge regression
Linear regression with ℓ2 -norm regularization
Loss:
1 𝑛𝑛 2
ℒ 𝜽𝜽 = � 𝑓𝑓 𝜽𝜽, 𝒙𝒙(𝑙𝑙) − 𝑦𝑦 (𝑙𝑙) + 𝜆𝜆 𝜽𝜽 2
2𝑛𝑛 𝑙𝑙=1
Hence for the k-th client:
𝛼𝛼 𝑛𝑛𝑘𝑘
𝜃𝜃0𝑖𝑖+1 = 1 − 2𝜆𝜆𝜆𝜆 𝜃𝜃0𝑖𝑖 − ∑𝑙𝑙=1 𝑓𝑓 𝜽𝜽, 𝒙𝒙(𝑙𝑙) ) − 𝑦𝑦 (𝑙𝑙)
𝑛𝑛𝑘𝑘

𝛼𝛼 (𝑙𝑙)
𝜃𝜃𝑗𝑗𝑖𝑖+1 = 1 − 2𝜆𝜆𝜆𝜆 𝜃𝜃0𝑖𝑖 − ∑𝑛𝑛𝑙𝑙=1
𝑘𝑘
𝑓𝑓 𝜽𝜽, 𝒙𝒙(𝑙𝑙) ) − 𝑦𝑦 (𝑙𝑙) 𝑥𝑥𝑗𝑗
𝑛𝑛𝑘𝑘
https://doi.org/10.1016/j.ins.2020.12.007
Logistic regression
1
𝑦𝑦 =
1 + 𝑒𝑒 −𝒇𝒇(𝜽𝜽,𝒙𝒙)
Loss function can be approximated by second-order Taylor series
and ℓ2 -norm regularization

1 𝑛𝑛 1 𝑙𝑙 1 2
ℒ 𝜽𝜽 = � log 2 − 𝑦𝑦 𝑓𝑓 𝜽𝜽, 𝒙𝒙 𝑙𝑙 + 𝑓𝑓 𝜽𝜽, 𝒙𝒙 𝑙𝑙
𝑛𝑛 𝑙𝑙=1 2 8
+ 𝜆𝜆 𝜽𝜽 2

https://doi.org/10.1016/j.ins.2020.12.007
Logistic regression
Hence for the k-th client:
𝛼𝛼 1 1
𝜃𝜃0𝑖𝑖+1 = 1 − 2𝜆𝜆𝜆𝜆 𝜃𝜃0𝑖𝑖 − ∑𝑛𝑛𝑙𝑙=1
𝑘𝑘
𝑓𝑓 𝜽𝜽, 𝒙𝒙(𝑙𝑙) ) − 𝑦𝑦 (𝑙𝑙)
𝑛𝑛𝑘𝑘 4 2

𝛼𝛼 1 1 (𝑙𝑙)
𝜃𝜃𝑗𝑗𝑖𝑖+1 = 1 − 2𝜆𝜆𝜆𝜆 𝜃𝜃0𝑖𝑖 − ∑𝑛𝑛𝑙𝑙=1
𝑘𝑘
𝑓𝑓 𝜽𝜽, 𝒙𝒙(𝑙𝑙) ) − 𝑦𝑦 (𝑙𝑙) 𝑥𝑥𝑗𝑗
𝑛𝑛𝑘𝑘 4 2
Some challenges
Expensive communication
Can reduce communication in federated optimization by:
• Limiting number of devices involved in communication
• Reducing number of communication rounds
• Reducing size of messages sent over network
FedAvg
At each communication round:
• (i) run SGD locally, then
• (ii) average the model updates
Reduces communication by:
• (i) performing local updating,
• (ii) communicating with a subset of devices

Why is it useful to perform `local-updating`?


Communication-efficient FL
Reduce the size of messages

Common approaches:
• Dimensionality reduction (low-rank, sparsity)
- Directly learn model updates that have reduced dimension/size
• Compression
- Take regular (full dimension) updates and then compress them
Heterogeneity
Heterogeneous (i.e., non-identically distributed) data and systems
can bias optimization procedures
Non-IID Data in Federated Learning
Types of non-IID data:
• Feature distribution skew (covariate shift)
• Label distribution skew (prior probability shift)
• Same label, different features (concept shift)
• Same features, different label (concept shift)
• Quantity skew or unbalancedness
Drift in FedAvg

(arXiv:1910.06378)
Possible solutions

(Zhu et al. Federated Learning on Non-IID Data: A Survey)


Possible solutions
• SCAFFOLD: Stochastic Controlled Averaging for Federated
Learning (arXiv:1910.06378)
- uses control variates (variance reduction) to correct for the ‘client-drift’ in
its local updates
• FedProx - a federated optimization algorithm with a proximal
term (Li, et al., Federated optimization in heterogeneous
networks, MLSys, 2020)
- adds a proximal term to the local sub-problem to effectively limit the
impact of variable local updates
Federated learning of personalized models
• Learning from non-i.i.d. data is difficult/slow because each party wants the
model to go in a particular direction
• If data distributions are very different, learning a single model which
performs well for all parties may require a very large number of parameters
• Another direction to deal with non-i.i.d. data is thus to lift the requirement
that the learned model should be the same for all parties (“one size fits all”)
• Instead, we can allow each party k to learn a (potentially simpler)
personalized model θk but design the objective so as to enforce some kind of
collaboration
Approaches for personalization
• Multi-task learning
- Jointly learn shared, yet personalized models
• Fine-tuning
- Learn a global model, then “fine-tune”/adapt it on local data
- See also: transfer learning, domain adaptation
• Meta learning (initialization-based)
- Learn initialization over multiple tasks, then train locally
Exam material
• Slides
• McMahan et al., Communication-Efficient Learning of Deep
Networks from Decentralized Data, 2016.
Next…
• Thursday:
Lecture - Vertical federated learning (Afsana)
• Next week:
No education - carnival

You might also like