Professional Documents
Culture Documents
Differential Privacy in The Wild: A Tutorial On Current Practices and Open Challenges
Differential Privacy in The Wild: A Tutorial On Current Practices and Open Challenges
Michael Hay
Assistant Professor, Colgate University
Xi He
Ph.D. Candidate, Duke University
Source (http://www.agencypja.com/site/assets/files/1826/marketingdata-1.jpg)
Source (esri.com)
Genome Wide
Association Studies
Human Mobility
analysis
Tutorial: Differential Privacy in the Wild 4
Personal data is … well … personal!
Redlining
Age
Income
Address
Likes/Dislikes Discrimination
Sexual Orientation
Medical History
Source (time.com)
Physical/Financial
Harm
Source (http://xkcd.org/834/)
• What is privacy?
Tutorial: Differential Privacy in the Wild 12
Module 2: Differential Privacy
• Differential Privacy Definition
• Basic Algorithms
– Laplace & Exponential Mechanism
– Randomized Response
• Composition Theorems
• Open Questions
Tutorial: Differential Privacy in the Wild 14
Module 4: Applications I
• Private Empirical Risk Minimization
– E.g. SVM, logistic regression
– Make a specific learning approach private
– OnTheMap RAPPOR
• Blowfish Privacy
• What is privacy?
Tutorial: Differential Privacy in the Wild 20
Statistical Databases
Individuals with Person 1 Person 2 Person 3 Person N
sensitive data r1 r2 r3 rN
Economists
Medical Data
Release
Linkage Attack
•Name •Name • Governor of MA
•SSN • Zip •Address
uniquely identified
•Visit Date •Date
•Diagnosis • Birth Registered using ZipCode,
date Birth Date, and Sex.
•Procedure •Party
•Medication • Sex affiliation
•Total Charge •Date last Name linked to
voted Diagnosis
Medical Data Voter List
Release
Linkage Attack
•Name •Name 87 % of US
• Governor of population
MA
•SSN • Zip •Address
•Date
uniquely identified
•Visit Date
•Diagnosis • Birth Registered using ZipCode,
date Birth Date, and Sex.
•Procedure •Party
•Medication • Sex affiliation
•Total Charge •Date last
voted Quasi
Identifier
Medical Data Voter List
Release
Server
Output can disclose
sensitive information DB
about individuals
Server
!!"#$%&' (!", !)!!
DB
Server
!!"#$%&' (!", !)!!
DB
f ( DB )
Individuals do not want
server to infer their
records
• Statistical Database:
- A single database and a single agent
- Want to release aggregate statistics about a set of
records without allowing access to individual records
Alice has
Cancer
48
[MKGV 06]
130**
<30
<30
*
*
Heart
Cancer
1485*
>40
>40
*
*
Cancer
Heart
49
A privacy mechanism must be able to
protect individuals’ privacy from
attackers who may possess
background knowledge
1-17 * * * * * * * *
18-44 70 40 13 * * * * *
85+ 34 29 * * * * * *
#Hospital discharges in NJ of ovarian cancer
patients, 2009
1-17 3 1 * * * * * *
18-44 70 40 13 * = 535* –
* * *
(40+236+229+29)
45-64 330 236 31 32 * * 11 *
85+ 34 29 * * * * * *
#Hospital discharges in NJ of ovarian cancer
patients, 2009
18-44 70 40 13 * * * * *
85+ 34 29 * * * * * *
#Hospital discharges in NJ of ovarian cancer
patients, 2009
18-44 70 40 13 * * * * *
85+ 34 29 [1-3] * * * * *
[VSJO 13]
[DN03, GKS08]
[KL10, MK15]
• Basic Algorithms
– Laplace & Exponential Mechanism
– Randomized Response
• Composition Theorems
Differential Privacy
For every pair of inputs that
For every output …
differ in one row
D1 D2 O
If algorithm A satisfies differential privacy then
Pr[A(D1) = O]
< exp(ε) (ε>0)
Pr[A(D2) = O] .
Intuition: adversary should not be able to use
Module 2
output O to distinguish between any D1 and D2
Tutorial: Differential Privacy in the Wild 64
Why pairs of datasets that differ in one row?
For every pair of inputs that
For every output …
differ in one row
D1 D2 O
D1 D2 O
D1
.
.
.
D2
P [ A(D2) = Ok ]
Set of all
Module 2 outputs
Tutorial: Differential Privacy in the Wild 67
Should not be able to distinguish whether input
was D1 or D2 no matter what the output
Worst discrepancy
in probabilities
D1
.
.
.
D2
Module 2 Tutorial: Differential Privacy in the Wild 68
Privacy Parameter ε
For every pair of inputs that
For every output …
differ in one row
D1 D2 O
Pr[A(D1) = O] ≤ eε Pr[A(D2) = O]
• Basic Algorithms
– Laplace & Exponential Mechanism
– Randomized Response
• Composition Theorems
Pr = 1
Pr = 0
D1 D2 O
Pr[D1 à O]
Pr[D2 à O] = 0 implies =∞
Pr[D2 à O]
Laplace Mechanism
Query q
Researcher
Privacy depends on η
the λ parameter
Mean: 0, 0.4
Variance: 2 λ2 0.2
0
Module 2 Tutorial: Differential Privacy in the Wild 77
-10 -8 -6 -4 -2 0 2 4 6 8 10
How much noise for privacy?
[Dwork et al., TCC 2006]
• Solution: 3 + η, Y
• Sensitivity = b
= Var( Lap(S(q)/ε) )
= 2*S(q)2 / ε2
• Sensitivity of w
Y N
N Y
N N
• E(πhat) = π
• Var(π hat) =
Privacy
• Provide the same ε-differential privacy guarantee
Utility
• Suppose a database with N records where µN records
have disease = Y.
• Query: # rows with Disease=Y
• Basic Algorithms
– Laplace & Exponential Mechanism
– Randomized Response
• Composition Theorems
• Building blocks
– Laplace, exponential mechanism and randomized
response
• Composition rules help build complex
algorithms using building blocks
• Open Questions
Module 3 Tutorial: Differential Privacy in the Wild 113
Problem Formulation
• Input:
– Private database D consisting of a single table
(each tuple represents data of single individual)
– Workload W of counting queries* with arbitrary predicates
Module 3 http://www.ccs.neu.edu/home/amislove/twittermood/
Tutorial: Differential Privacy in the Wild 115
Statistical agencies: data publishing
• U.S. Census Bureau publishes
statistics that can typically be
derived from marginals
• A marginal over attributes
𝐴1, … , 𝐴𝑘 reports count for
each combination of attribute
values.
– aka cube, contingency table
– E.g. 2-way marginal on
EmploymentStatus and Gender
• Thousands of marginals
released
https://factfinder.census.gov/
Module 3 Tutorial: Differential Privacy in the Wild 116
[WWLTRD09]
• Perform association test using these frequencies (e.g., Chi Square Test)
to identify SNPs highly associated with disease
Shown is workload of 3
range queries
BeijingTaxi dataset[1]:
4,268,780 records of (lat,lon)
pairs of taxi pickup locations
in Beijing, China in 1 month.
Error
Module 3 Tutorial: Differential Privacy in the Wild 127
[CPSSY12]
Error Error
Module 3 Tutorial: Differential Privacy in the Wild 131
Hierarchy lowers error on large ranges but incurs
Error comparison
slightly higher error for small ranges
Error Error
Module 3 Tutorial: Differential Privacy in the Wild 132
Credit: Cormode
Data transformations
• Can think of trees as transform of input
PrivBayes ABC
CD …..
ABCDEFG…………… BE DEF
decompose
High-dimensional table 𝑇 Low-dimensional tables
Dual Query
• Problem of generating synthetic data formulated as a zero sum game
between
– Data player: generates synthetic records to reduce query error
– Query player: chooses queries with high error (on current synthetic dataset)
– Theoretical analysis of utility comes from studying equilibrium of game
Empirical benchmarks
• [HMMCZ16] propose a novel evaluation framework for
standardized evaluation of privacy algorithms.
• Study of algorithms for range query answering over 1 and 2D
• Benchmark website www.dpcomp.org
Some data-
dependent
algorithms fail to
offer benefits at
larger scales (no.
of tuples)
Yes No 99F No
1 0 99 -
Data Label
- - - ++
--- ++ +
Classification
- -- + Algorithm
-
- +
Private Public
Data Classifier
- - - ++
--- ++ Distribution 𝑃 over
- -- + labelled examples
- +
@
1 U
1
𝑓 𝐷 = 𝑎𝑟𝑔𝑚𝑖𝑛R 𝜆 ∥ 𝜔 ∥ + ; 𝑙(𝜔, 𝑥> , 𝑦> )
2 𝑛
>AB
Regularizer Risk
(Model Complexity) (Training Error)
-
--- +
+ Pick 𝜔 from distribution
--- + + near the optimal solution
- +
+
-
--- +
+
--- + + Loss
- +
+
Perturbation
– OnTheMap RAPPOR
• Employer
– Geography (Census blocks)
– Industry
– Ownership (Public vs Private)
Module 5 Tutorial: Differential Privacy in the Wild 38
Why release such data?
Experimental Setup:
• OTM: Currently published
OnTheMap data used as original data.
• All destinations in Minnesota.
• 120,690 origins per destination.
• chosen by pruning out blocks that are >
100 miles from the destination.
– OnTheMap RAPPOR
Finance.com WeirdStuff.com
Fashion.com
Finance.com
WeirdStuff.com
Fashion.com
On a binary domain:
With probability p report true value
With probability 1-p report false value
69 bits on
Fake Bloom
filter (B'): ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
145 bits on
Report sent
to server: | || | |||| || ||| | | ||| ||| || ||| || | | ||| | || || || | | ||| | ||| | | || ||| | || | ||| ||| |||||| || || ||| ||| || | ||||||||||||||||||||
| | | || | | ||| | | |||| | || | | || | || | || || || | | | || | || | | | ||| |||| | | ||| | || || || || | | | || | | ||| | | || | | | ||| | || | | | |||| | | || || |
1 8 32 64 128 256
Figure 1: Life of a RAPPOR report: The client value of the string “The number 68” is hashed onto the Bloom
Module 5 using h (here 4) hash functions.
filter B Tutorial:
For Differential
this string,Privacy randomized response B 0 is produces and
in the Wild
a Permanent 52
memoized by the client, and this B 0 is used (and reused in the future) to generate Instantaneous randomized
RAPPOR Solution
• Idea 2: Use RR on bloom filter bits
69 bits on
Fake Bloom
filter (B'): ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
145 bits on
Report sent
to server: | || | |||| || ||| | | ||| ||| || ||| || | | ||| | || || || | | ||| | ||| | | || ||| | || | ||| ||| |||||| || || ||| ||| || | ||||||||||||||||||||
| | | || | | ||| | | |||| | || | | || | || | || || || | | | || | || | | | ||| |||| | | ||| | || || || || | | | || | | ||| | | || | | | ||| | || | | | |||| | | || || |
1 8 32 64 128 256
Figure 1: Life of a RAPPOR report: The client value of the string “The number 68” is hashed onto the Bloom
Module 5 using h (here 4) hash functions.
filter B Tutorial:
For Differential
this string,Privacy in the
a Permanent randomized response B 0 is produces and
Wild 53
memoized by the client, and this B 0 is used (and reused in the future) to generate Instantaneous randomized
RAPPOR Solution
• Idea 3: Again use RR on the Fake bloom filter
Why randomize two times?
69 bits on
Fake Bloom
filter (B'): ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
145 bits on
Report sent
to server: | || | |||| || ||| | | ||| ||| || ||| || | | ||| | || || || | | ||| | ||| | | || ||| | || | ||| ||| |||||| || || ||| ||| || | ||||||||||||||||||||
| | | || | | ||| | | |||| | || | | || | || | || || || | | | || | || | | | ||| |||| | | ||| | || || || || | | | || | | ||| | | || | | | ||| | || | | | |||| | | || || |
1 8 32 64 128 256
Figure 1: Life of a RAPPOR report: The client value of the string “The number 68” is hashed onto the Bloom
Module 5 using h (here 4) hash functions.
filter B Tutorial:
For Differential
this string,Privacy randomized response B 0 is produces and
in the Wild
a Permanent 54
0
memoized by the client, and this B is used (and reused in the future) to generate Instantaneous randomized
Server Report Decoding
• Step 5: estimates bit frequency from reports 𝑓V(𝐷)
• Step 6: estimate frequency of candidate strings with
regression from 𝑓V(𝐷)
1 1 0 1 0 0 0 1 0 1
0 1 0 1 0 0 0 1 0 0
.
.
.
0 1 0 1 0 0 0 1 0 1
𝑓V(𝐷)
.
.
.
23 12 12 12 12 2 3 2 1 10
Finance.com Fashion.com
WeirdStuff.com
Module 5 Tutorial: Differential Privacy in the Wild 55
Evaluation
http://google.github.io/rappor/examples/report.html
– OnTheMap RAPPOR
• Employer
– Geography (Census blocks)
– Industry
– Ownership (Public vs Private)
Module 5 Tutorial: Differential Privacy in the Wild 62
US Census Bureau Data
Neighboring databases … differ in one record.
Existing LODES Data as Presented in the
Tool
Correlations and DP
Bob Alice
Output (d1 + δ)
δ ~ Laplace(1/ε)
Output (d2 + δ)
Space of all
possible tables
Space of all
possible tables
Pufferfish
• Discriminative Pairs:
Mutually exclusive pairs of secrets.
– (“Bob is in the table”, “Bob is not in the table”)
– (“Bob has cancer”, “Bob has diabetes”)
– I.I.D.: Set of all f such that: P(data = {r1, r2, …, rk}) = f(r1) x f(r2)
x…x f(rk)
• Discriminative Pairs:
– !!! !: record i takes the value x
– !!! !: record i is not in the database
– !!"#$% = !!! , !!! ∀! ∈ !"#, ∀!!"#$!%!! !
• Data evolution:
– For all θ = [ f1, f2, f3, …, fk ]
! !"#" = ! Θ] = ! !! (!! )!
!! ∈!
• Data evolution:
– For all θ = [ f1, f2, f3, …, fk ]
! !"#" = ! Θ] = ! !! (!! )!
!! ∈!
A mechanism M satisfies differential privacy
if and only if
it satisfies Pufferfish instantiated using Spairs and {θ}
• Blowfish Privacy
• Pufferfish meaning:
– Data: matrix of bits
– Secrets: whether or not an edge (𝑢, 𝑣) is in the graph -- bit at
(𝑢, 𝑣) is 0 or 1
– Data generating distributions: All graphs where each edge 𝒆 is
independently present with probability 𝑝𝑒.
• But …
– Edges are not independent in real graphs
• Pufferfish meaning
– Data: a matrix of locations
– Secrets: Whether or not individual was at some location at some
point of time
– Data Generating Distributions: All trajectories where an
individual’s location at some time is independent of all other
locations …
• But …
– Current location depends on previous locations…
𝑓𝑝𝑟𝑖𝑣𝑎𝑡𝑒(𝐺, 𝜖)
Degree 0 1 2 3 4 5
Frequency 0 5 0 0 0 1
𝐷(𝐺) = [0,5,0,0,0,1]
Answer: 4
Remove edge (𝑖, 𝑗), the changes in degree frequency
Degree … 𝑑𝑖 -1 𝑑𝑖 … 𝑑𝒋-1 𝑑𝒋 …
Frequency … +1 -1 … +1 -1 …
• Blowfish Privacy
‘show me how you move and I will tell you who you are’
[GKC10]
Privacy Budget 𝜖 𝜖 𝜖 …. 𝜖 𝜖
• 𝑤-event DP
– Neighboring stream prefix 𝑥B, 𝑥U, . . , 𝑥x , 𝑥Bb, 𝑥Ub. . , 𝑥xb
• For any 𝑖 < 𝑗, if 𝑥> ≠ 𝑥>b , and 𝑥¶ ≠ 𝑥¶b
then 𝑗 − 𝑖 + 1 ≤ 𝑤
• 𝑥> and 𝑥>b are the same or neighboring
àProtect updates happening within 𝑤-event with privacy
budget 𝜖
x1 x2 xi …. xj …. xt
x'1 x’2 x’i …. x’j …. x’t
≤𝑤 time
Module 6 Tutorial: Differential Privacy in the Wild 114
Different Levels of Protection
[KPXP14]
• 𝑤-event DP
– E.g. 𝑤=3
𝜖
• 𝑤-event DP
– E.g. 𝑤=3
𝜖
• 𝑤-event DP
– Allow budget allocation strategy:
• Adaptive assign privacy budgets to events within the same
𝑤-window
• E.g. 𝑤=3 𝜖
g g g g
“ · U ·
Released locations z1 z2 z3 z4 …. zt-1 zt
True locations x1 x2 x3 x4 …. xt-1 xt
1 2 3 …. t-1 t time
• 𝑤-event DP
– Allow budget allocation strategy:
• Adaptive assign privacy budgets to events within the same
𝑤-window
5
• E.g. 𝑤=3 𝜖≤𝜖
6
g g g g
“ · U ·
Released locations z1 z2 z3 z4 …. zt-1 zt
True locations x1 x2 x3 x4 …. xt-1 xt
1 2 3 …. t-1 t time
• Edge DP • 𝜖-indistinguishability
• Node DP • Event DP
• 𝑤-event DP
• Trajectory level DP
• Blowfish Privacy
Is equivalent to
• Blowfish Privacy
Michael Hay
Assistant Professor, Colgate University
Xi He
Ph.D. Candidate, Duke University