Download as pdf or txt
Download as pdf or txt
You are on page 1of 292

Differential Privacy in the Wild

A Tutorial on Current Practices and Open Challenges


About the Presenters
Ashwin Machanavajjhala
Assistant Professor, Duke University

“What does privacy mean … mathematically?”

Michael Hay
Assistant Professor, Colgate University

“Can algorithms be provably private and useful?”

Xi He
Ph.D. Candidate, Duke University

“Can privacy algorithms work in real world systems?”


Tutorial: Differential Privacy in the Wild 2
Our world is increasingly data driven

Source (http://www.agencypja.com/site/assets/files/1826/marketingdata-1.jpg)

Tutorial: Differential Privacy in the Wild 3


Aggregated Personal Data is invaluable
Advertising

Source (esri.com)

Genome Wide
Association Studies

Human Mobility
analysis
Tutorial: Differential Privacy in the Wild 4
Personal data is … well … personal!
Redlining

Age
Income
Address
Likes/Dislikes Discrimination
Sexual Orientation
Medical History

Source (time.com)
Physical/Financial
Harm

Tutorial: Differential Privacy in the Wild 5


Aggregated Personal Data …
… is made publicly available in many forms.

De-identified records Statistics Predictive models


(e.g., medical) (e.g., demographic) (e.g., advertising)

Tutorial: Differential Privacy in the Wild 6


That’s fine … I am anonymous!

Source (http://xkcd.org/834/)

Tutorial: Differential Privacy in the Wild 7


Anonymity is not enough …

Tutorial: Differential Privacy in the Wild 8


… and predictive models can breach
privacy too

Tutorial: Differential Privacy in the Wild 9


Need data analysis algorithms that can
mine aggregated personal data with
provable guarantees of privacy for
individuals.

This is the goal of Differential Privacy.

Tutorial: Differential Privacy in the Wild 10


Outline of the Tutorial
1. What is Privacy?
2. Differential Privacy
3. Answering Queries on Tabular Data
Break
4. Applications I: Machine Learning
5. Privacy in the Real World
6. Applications II: Networks and Trajectories

Tutorial: Differential Privacy in the Wild 11


Module 1: What is Privacy?
• Privacy Problem Statement

• What privacy is not …


– Encryption
– Anonymization
– Restricted Query Answering

• What is privacy?
Tutorial: Differential Privacy in the Wild 12
Module 2: Differential Privacy
• Differential Privacy Definition

• Basic Algorithms
– Laplace & Exponential Mechanism
– Randomized Response

• Composition Theorems

Tutorial: Differential Privacy in the Wild 13


Module 3: Answering queries on
Tabular data
• Answering query workloads on tabular databases

• Theory: two seminal results

• Survey of algorithm design ideas


– Low dimensional range queries
– Queries on high dimensional data

• Open Questions
Tutorial: Differential Privacy in the Wild 14
Module 4: Applications I
• Private Empirical Risk Minimization
– E.g. SVM, logistic regression
– Make a specific learning approach private

• Private Stochastic Gradient Descent


– E.g. Deep learning
– Make a general purposed fitting technique private

• Other Important Problems in Private Learning


Tutorial: Differential Privacy in the Wild 15
Module 5: Privacy in the Real World
• Real world deployments of differential privacy

– OnTheMap RAPPOR

• Privacy beyond Tabular Data


– No Free Lunch Theorem
– Customizing differential privacy using Pufferfish

Tutorial: Differential Privacy in the Wild 16


Module 6: Applications II
• Pufferfish Privacy for Non-tabular Data
Social network Location trajectories

• Blowfish Privacy

Tutorial: Differential Privacy in the Wild 17


Scope of the Tutorial
What we do not cover:
• Securing data using encryption
• Computation on encrypted data
• Computationally bounded DP
• De-anonymization
• Anonymization schemes (k-anonymity, l-diversity, etc.)
• Access control

Tutorial: Differential Privacy in the Wild 18


MODULE 1:
PROBLEM FORMULATION

Module 1 Tutorial: Differential Privacy in the Wild 19


Module 1: What is Privacy?
• Privacy Problem Statement

• What privacy is not …


– Encryption
– Anonymization
– Restricted Query Answering

• What is privacy?
Tutorial: Differential Privacy in the Wild 20
Statistical Databases
Individuals with Person 1 Person 2 Person 3 Person N
sensitive data r1 r2 r3 rN

Hospital Census Google


Data Collectors
D D D
B B B

Economists

Data Analysts Medical Information


Doctors
Researchers Recommen- Retrieval
dation Researchers
Algorithms
Module 1 Tutorial: Differential Privacy in the Wild 21
[S 02]
The Massachusetts Governor
Privacy Breach
•Name
•SSN • Zip
•Visit Date
• Birth
•Diagnosis
date
•Procedure
•Medication • Sex
•Total Charge

Medical Data
Release

Module 1 Tutorial: Differential Privacy in the Wild 22


[S 02]
The Massachusetts Governor
Privacy Breach
•Name •Name
•SSN • Zip •Address
•Visit Date •Date
• Birth Registered
•Diagnosis
date
•Procedure •Party
•Medication • Sex affiliation
•Total Charge •Date last
voted

Medical Data Voter List


Release

Module 1 Tutorial: Differential Privacy in the Wild 23


[S 02]

Linkage Attack
•Name •Name • Governor of MA
•SSN • Zip •Address
uniquely identified
•Visit Date •Date
•Diagnosis • Birth Registered using ZipCode,
date Birth Date, and Sex.
•Procedure •Party
•Medication • Sex affiliation
•Total Charge •Date last Name linked to
voted Diagnosis
Medical Data Voter List
Release

Module 1 Tutorial: Differential Privacy in the Wild 24


[S 02]

Linkage Attack
•Name •Name 87 % of US
• Governor of population
MA
•SSN • Zip •Address
•Date
uniquely identified
•Visit Date
•Diagnosis • Birth Registered using ZipCode,
date Birth Date, and Sex.
•Procedure •Party
•Medication • Sex affiliation
•Total Charge •Date last
voted Quasi
Identifier
Medical Data Voter List
Release

Module 1 Tutorial: Differential Privacy in the Wild 25


Statistical Database Privacy
Function provided
by the analyst

Server
Output can disclose
sensitive information DB
about individuals

Person 1 Person 2 Person 3 Person N


r1 r2 r3 rN

Module 1 Tutorial: Differential Privacy in the Wild 26


Statistical Database Privacy
Privacy for individuals
(controlled by a parameter ε)

Server
!!"#$%&' (!", !)!!
DB

Person 1 Person 2 Person 3 Person N


r1 r2 r3 rN

Module 1 Tutorial: Differential Privacy in the Wild 27


Statistical Database Privacy
Utility for analyst

Server
!!"#$%&' (!", !)!!
DB

Person 1 Person 2 Person 3 Person N


r1 r2 r3 rN

Module 1 Tutorial: Differential Privacy in the Wild 28


Statistical Database Privacy
(untrusted collector)
Server wants to
compute f Server

f ( DB )
Individuals do not want
server to infer their
records

Person 1 Person 2 Person 3 Person N


r1 r2 r3 rN

Module 1 Tutorial: Differential Privacy in the Wild 29


Statistical Database Privacy
(untrusted collector)
Perturb records to
ensure privacy for Server
individuals and
Utility for server f ( DB* )

Person 1 Person 2 Person 3 Person N


r1 r2 r3 rN

Module 1 Tutorial: Differential Privacy in the Wild 30


Statistical Databases in real-world applications

Application Data Private Analyst Function (utility)


Collector Information
Medical Hospital Disease Epidemiologist Correlation between
disease and geography
Genome Hospital Genome Statistician/ Correlation between
analysis Researcher genome and disease
Advertising Google/FB/Y! Clicks/Browsi Advertiser Number of clicks on
ng an ad by
age/region/gender …
Social Facebook Friend links / Another user Recommend other
Recommen- profile users or ads to users
dations based on social
network

Module 1 Tutorial: Differential Privacy in the Wild 31


Statistical Databases in real-world applications

• Settings where data collector may not be


trusted (or may not want the liability …)
Application Data Collector Private Function (utility)
Information
Location Verizon/AT&T Location Traffic prediction
Services
Recommen- Amazon/Google Purchase Recommendation
dations history model
Traffic Internet Service Browsing Traffic pattern of
Shaping Provider history groups of users

Module 1 Tutorial: Differential Privacy in the Wild 32


Privacy is not …

Module 1 Tutorial: Differential Privacy in the Wild 33


Statistical Database Privacy is not …
• Encryption:

Module 1 Tutorial: Differential Privacy in the Wild 34


Statistical Database Privacy is not …
• Encryption:
Alice sends a message to Bob such that Trudy
(attacker) does not learn the message. Bob should
get the correct message …

• Statistical Database Privacy:


Bob (attacker) can access a database
- Bob must learn aggregate statistics, but
- Bob must not learn new information about
individuals in database.
Module 1 Tutorial: Differential Privacy in the Wild 35
Statistical Database Privacy is not …
• Computation on Encrypted Data:

Module 1 Tutorial: Differential Privacy in the Wild 36


Statistical Database Privacy is not …
• Computation on Encrypted Data:
- Alice stores encrypted data on a server
controlled by Bob (attacker).
- Server returns correct query answers to Alice,
without Bob learning anything about the data.

• Statistical Database Privacy:


- Bob is allowed to learn aggregate properties of
the database.

Module 1 Tutorial: Differential Privacy in the Wild 37


Statistical Database Privacy is not …
• The Millionaires Problem:

Module 1 Tutorial: Differential Privacy in the Wild 38


Statistical Database Privacy is not …
• Secure Multiparty Computation:
- A set of agents each having a private input xi …
- … Want to compute a function f(x1, x2, …, xk)
- Each agent can learn the true answer, but must
learn no other information than what can be
inferred from their private input and the answer.

• Statistical Database Privacy:


- Function output must not disclose individual inputs.
Module 1 Tutorial: Differential Privacy in the Wild 39
Statistical Database Privacy is not …
• Access Control:

Module 1 Tutorial: Differential Privacy in the Wild 40


Statistical Database Privacy is not …
• Access Control:
- A set of agents want to access a set of resources (could
be files or records in a database)
- Access control rules specify who is allowed to access (or
not access) certain resources.
- ‘Not access’ usually means no information must be
disclosed

• Statistical Database:
- A single database and a single agent
- Want to release aggregate statistics about a set of
records without allowing access to individual records

Module 1 Tutorial: Differential Privacy in the Wild 41


Privacy Problems
• In todays cloud context a number of privacy problems arise:
– Encryption when communicating data across a unsecure channel
– Secure Multiparty Computation when different parties want to
compute on a function on their private data without using a
centralized third party
– Computing on encrypted data when one wants to use an unsecure
cloud for computation
– Access control when different users own different parts of the data

• Statistical Database Privacy:


Quantifying (and bounding) the amount of information
disclosed about individual records by the output of a valid
computation.

Module 1 Tutorial: Differential Privacy in the Wild 42


What is privacy?

Module 1 Tutorial: Differential Privacy in the Wild 43


Privacy Breach: Informal Definition
A privacy mechanism M(D)
that allows
an unauthorized party
to learn sensitive information about any individual in D,

which could not have learnt without access to M(D).

Module 1 Tutorial: Differential Privacy in the Wild 44


Alice

Alice has
Cancer

Is this a privacy breach? NO

Module 1 Tutorial: Differential Privacy in the Wild 45


Privacy Breach: Revised Definition
A privacy mechanism M(D) that allows
an unauthorized party
to learn sensitive information about
any individual Alice in D,

which could not have learnt without access to M(D)


if Alice was not in the dataset.

Module 1 Tutorial: Differential Privacy in the Wild 46


[S 02]

K-Anonymity: Avoiding Linkage Attacks


• If every row corresponds to one individual …

… every row should look like k-1 other rows


based on the quasi-identifier attributes

Module 1 Tutorial: Differential Privacy in the Wild 47


K-Anonymity
Zip Age Nationality Disease Zip Age Nationality Disease

13053 28 Russian Heart 130** <30 * Heart

13068 29 American Heart 130** <30 * Heart

13068 21 Japanese Flu 130** <30 * Flu

13053 23 American Flu 130** <30 * Flu

14853 50 Indian Cancer 1485* >40 * Cancer

14853 55 Russian Heart 1485* >40 * Heart

14850 47 American Flu 1485* >40 * Flu

14850 59 American Flu 1485* >40 * Flu

13053 31 American Cancer 130** 30-40 * Cancer

13053 37 Indian Cancer 130** 30-40 * Cancer

13068 36 Japanese Cancer 130** 30-40 * Cancer

13068 32 American Cancer 130** 30-40 * Cancer

48
[MKGV 06]

Problem: Background knowledge


Adversary knows Zip Age Nationality Disease

prior knowledge 130** <30 * Heart

about Umeko 130**

130**
<30

<30
*

*
Heart

Cancer

130** <30 * Cancer

Adversary learns 1485*

1485*
>40

>40
*

*
Cancer

Heart

Umeko has Cancer 1485* >40 * Flu

1485* >40 * Flu


Name Zip Age Nat. 130** 30-40 * Cancer
Umeko 13053 25 Japan 130** 30-40 * Cancer

130** 30-40 * Cancer

130** 30-40 * Cancer

49
A privacy mechanism must be able to
protect individuals’ privacy from
attackers who may possess
background knowledge

Module 1 Tutorial: Differential Privacy in the Wild 50


Healthcare Cost and Utilization Project
#Hospital discharges in NJ of ovarian cancer
patients, 2009 Counts less than k are
suppressed achieving k-
anonymity
Age #disc White Black Hispani Asian/ Native Other Missing
harge c Pcf American
s Hlnder
#dischar 735 535 82 58 18 * 19 22
ges

1-17 * * * * * * * *

18-44 70 40 13 * * * * *

45-64 330 236 31 32 * * 11 *

65-84 298 229 35 13 * * * *

85+ 34 29 * * * * * *
#Hospital discharges in NJ of ovarian cancer
patients, 2009

Age #disc White Black Hispani Asian/ Native Other Missing


harge c Pcf American
s Hlnder
#dischar 735 535 82 58 18 1 19 22
ges

1-17 3 1 * * * * * *

18-44 70 40 13 * = 535* –
* * *

(40+236+229+29)
45-64 330 236 31 32 * * 11 *

65-84 298 229 35 13 * * * *

85+ 34 29 * * * * * *
#Hospital discharges in NJ of ovarian cancer
patients, 2009

Age #disc White Black Hispani Asian/ Native Other Missing


harge c Pcf American
s Hlnder
#dischar 735 535 82 58 18 1 19 22
ges

1-17 3 1 [0-2] [0-2] [0-2] [0-2] [0-2] [0-2]

18-44 70 40 13 * * * * *

45-64 330 236 31 32 * * 11 *

65-84 298 229 35 13 * * * *

85+ 34 29 * * * * * *
#Hospital discharges in NJ of ovarian cancer
patients, 2009

Age #disc White Black Hispani Asian/ Native Other Missing


harge c Pcf American
s Hlnder
#dischar 735 535 82 58 18 1 19 22
ges

1-17 3 1 [0-2] [0-2] [0-2] [0-2] [0-2] [0-2]

18-44 70 40 13 * * * * *

45-64 330 236 31 32 * * 11 *

65-84 298 229 35 13 * * * *

85+ 34 29 [1-3] * * * * *
[VSJO 13]

Can reconstruct tight bounds on rest of data

Age #disch White Black Hispanic Asian/ Native Other Missing


arges Pcf American
Hlnder
#dischar 735 535 82 58 18 1 19 22
ges

1-17 3 1 [0-2] [0-2] [0-1] [0] [0-1] [0-1]

18-44 70 40 13 [9-10] [0-6] [0] [0-6] [1-8]

45-64 330 236 31 32 [10] [0] 11 [10]

65-84 298 229 35 13 [2-8] [1] [2-8] [4-10]

85+ 34 29 [1-3] [1-4] [0-1] [0] [0-1] [0-1]


Multiple Release problem
• Privacy preserving access to data must
necessarily release some information about
individual records (to ensure utility)

• However, k-anonymous algorithms can reveal


individual level information even with two
releases.

Module 1 Tutorial: Differential Privacy in the Wild 57


A privacy mechanism must satisfy
composition …
… or allow a graceful degradation of privacy with
multiple invocations on the same data.

[DN03, GKS08]

Module 1 Tutorial: Differential Privacy in the Wild 58


Postprocessing the output of a privacy
mechanism must not change the privacy
guarantee

[KL10, MK15]

Module 1 Tutorial: Differential Privacy in the Wild 59


Privacy must not be achieved through
obscurity.
Attacker must be assumed to know the algorithm
used as well as all parameters

Module 1 Tutorial: Differential Privacy in the Wild 60


Summary
• Statistical database privacy is the problem of
releasing aggregates while not disclosing individual
records

• The problem is distinct from encryption, secure


computation and access control.

• Defining privacy is non-trivial


– Desiderata include resilience to background knowledge
and composition and closure under postprocessing.

Module 1 Tutorial: Differential Privacy in the Wild 61


MODULE 2:
DIFFERENTIAL PRIVACY

Module 2 Tutorial: Differential Privacy in the Wild 62


Module 2: Differential Privacy
• Differential Privacy Definition

• Basic Algorithms
– Laplace & Exponential Mechanism
– Randomized Response

• Composition Theorems

Tutorial: Differential Privacy in the Wild 63


[D 06]

Differential Privacy
For every pair of inputs that
For every output …
differ in one row

D1 D2 O
If algorithm A satisfies differential privacy then
Pr[A(D1) = O]
< exp(ε) (ε>0)
Pr[A(D2) = O] .
Intuition: adversary should not be able to use
Module 2
output O to distinguish between any D1 and D2
Tutorial: Differential Privacy in the Wild 64
Why pairs of datasets that differ in one row?
For every pair of inputs that
For every output …
differ in one row

D1 D2 O

Simulate the presence or absence of a


single record

Module 2 Tutorial: Differential Privacy in the Wild 65


Why all pairs of datasets …?
For every pair of inputs that
For every output …
differ in one row

D1 D2 O

Guarantee holds no matter what the


other records are.

Module 2 Tutorial: Differential Privacy in the Wild 66


Why all outputs?
P [ A(D1) = O1 ]
A(D1) = O1

D1

.
.
.

D2
P [ A(D2) = Ok ]

Set of all
Module 2 outputs
Tutorial: Differential Privacy in the Wild 67
Should not be able to distinguish whether input
was D1 or D2 no matter what the output

Worst discrepancy
in probabilities

D1

.
.
.

D2
Module 2 Tutorial: Differential Privacy in the Wild 68
Privacy Parameter ε
For every pair of inputs that
For every output …
differ in one row

D1 D2 O

Pr[A(D1) = O] ≤ eε Pr[A(D2) = O]

Controls the degree to which D1 and D2 can be distinguished.


Smaller ε gives more privacy (and worse utility)
Module 2 Tutorial: Differential Privacy in the Wild 69
Outline of the Module 2
• Differential Privacy

• Basic Algorithms
– Laplace & Exponential Mechanism
– Randomized Response

• Composition Theorems

Module 2 Tutorial: Differential Privacy in the Wild 70


Can deterministic algorithms satisfy differential privacy?

Module 2 Tutorial: Differential Privacy in the Wild 71


Non trivial deterministic algorithms do not
satisfy differential privacy
Space of all inputs Space of all outputs

Module 2 Tutorial: Differential Privacy in the Wild 72


Non-trivial deterministic algorithms do not
satisfy differential privacy
Non-trivial: at least 2 outputs in image

Module 2 Tutorial: Differential Privacy in the Wild 73


There exist two inputs that differ in one entry
mapped to different outputs.

Pr = 1

Pr = 0

Module 2 Tutorial: Differential Privacy in the Wild 74


Random Sampling …
… also does not satisfy differential privacy
Input Output

D1 D2 O

Pr[D1 à O]
Pr[D2 à O] = 0 implies =∞
Pr[D2 à O]

Module 2 Tutorial: Differential Privacy in the Wild 75


Output Randomization
Query

Database Add noise to


true answer
Researcher
• Add noise to answers such that:
– Each answer does not leak too much information about
the database.
– Noisy answers are close to the original answers.

Module 2 Tutorial: Differential Privacy in the Wild 76


[DMNS 06]

Laplace Mechanism
Query q

Database True answer


q(D) + η
q(D)

Researcher
Privacy depends on η
the λ parameter

h(η) α exp(-η / λ) Laplace Distribution – Lap(λ)


0.6

Mean: 0, 0.4

Variance: 2 λ2 0.2

0
Module 2 Tutorial: Differential Privacy in the Wild 77
-10 -8 -6 -4 -2 0 2 4 6 8 10
How much noise for privacy?
[Dwork et al., TCC 2006]

Sensitivity: Consider a query q: I à R. S(q) is the


smallest number s.t. for any neighboring tables D,
D’,
| q(D) – q(D’) | ≤ S(q)

Thm: If sensitivity of the query is S, then the


following guarantees ε-differential privacy.
λ = S/ε
Module 2 Tutorial: Differential Privacy in the Wild 78
Sensitivity: COUNT query D
Disease
• Number of people having disease (Y/N)
Y
• Sensitivity = 1
Y

• Solution: 3 + η, Y

where η is drawn from Lap(1/ε) N


– Mean = 0
– Variance = 2/ε2 N

Module 2 Tutorial: Differential Privacy in the Wild 79


Sensitivity: SUM query
• Suppose all values x are in [a,b]

• Sensitivity = b

Module 2 Tutorial: Differential Privacy in the Wild 80


Privacy of Laplace Mechanism
• Consider neighboring databases D and D’
• Consider some output O

Module 2 Tutorial: Differential Privacy in the Wild 81


Utility of Laplace Mechanism
• Laplace mechanism works for any function that
returns a real number

• Error: E(true answer – noisy answer)2

= Var( Lap(S(q)/ε) )

= 2*S(q)2 / ε2

Module 2 Tutorial: Differential Privacy in the Wild 82


Exponential Mechanism
• For functions that do not return a real number …
– “what is the most common nationality in this room”:
Chinese/Indian/American…

• When perturbation leads to invalid outputs …


– To ensure integrality/non-negativity of output

Module 2 Tutorial: Differential Privacy in the Wild 83


[MT 07]
Exponential Mechanism
Consider some function f (can be deterministic or probabilistic):
Inputs Outputs

How to construct a differentially private version of f ?

Module 2 Tutorial: Differential Privacy in the Wild 84


Exponential Mechanism
• Scoring function w: Inputs x Outputs à R

• D: nationalities of a set of people


• #(D, O): # people with nationality O
• f(D): most frequent nationality in D
• w(D, O) = |#(D, O) - #(D, f(D))|

Module 2 Tutorial: Differential Privacy in the Wild 85


Exponential Mechanism
• Scoring function w: Inputs x Outputs à R

• Sensitivity of w

where D, D’ differ in one tuple

Module 2 Tutorial: Differential Privacy in the Wild 86


Exponential Mechanism
Given an input D, and a scoring function w,

Randomly sample an output O from Outputs with


probability

• Note that for every output O, probability O is output > 0.

Module 2 Tutorial: Differential Privacy in the Wild 87


[W 65]
Randomized Response (a.k.a. local randomization)
D O
Disease Disease
(Y/N) (Y/N)
Y Y
With probability p,
Report true value
Y N
With probability 1-p,
N Report flipped value N

Y N

N Y

N N

Module 2 Tutorial: Differential Privacy in the Wild 88


Differential Privacy Analysis
• Consider 2 databases D, D’ (of size M) that
differ in the jth value
– D[j] ≠ D’[j]. But, D[i] = D’[i], for all i ≠ j

• Consider some output O

Module 2 Tutorial: Differential Privacy in the Wild 89


Utility Analysis
• Suppose n1 out of N people replied “yes”, and rest said “no”
• What is the best estimate for π = fraction of people with disease = Y?
• Extract an estimate through post-processing

πhat = {n1/n – (1-p)}/(2p-1)

• E(πhat) = π

• Var(π hat) =

Sampling Variance due to coin flips

Module 2 Tutorial: Differential Privacy in the Wild 90


Laplace Mechanism vs Randomized Response

Privacy
• Provide the same ε-differential privacy guarantee

• Laplace mechanism assumes data collected is trusted


• Randomized Response does not require data collected
to be trusted
– Also called a Local Algorithm, since each record is perturbed

Module 2 Tutorial: Differential Privacy in the Wild 91


Laplace Mechanism vs Randomized Response

Utility
• Suppose a database with N records where µN records
have disease = Y.
• Query: # rows with Disease=Y

• Std dev of Laplace mechanism answer: O(1/ε)


• Std dev of Randomized Response answer: O(√N)

Module 2 Tutorial: Differential Privacy in the Wild 92


Outline of the Module 2
• Differential Privacy

• Basic Algorithms
– Laplace & Exponential Mechanism
– Randomized Response

• Composition Theorems

Module 2 Tutorial: Differential Privacy in the Wild 93


Why Composition?
• Reasoning about privacy of
a complex algorithm is hard.

• Helps software design


– If building blocks are proven to be private, it would
be easy to reason about privacy of a complex
algorithm built entirely using these building blocks.

Module 2 Tutorial: Differential Privacy in the Wild 94


A bound on the number of queries
• In order to ensure utility, a statistical database
must leak some information about each
individual
• We can only hope to bound the
amount of disclosure

• Hence, there is a limit on number of


queries that can be answered

Module 2 Tutorial: Differential Privacy in the Wild 95


[DN03]
Dinur Nissim Result
• A vast majority of records in a database of size
n can be reconstructed when n log(n)2 queries
are answered by a statistical database …

… even if each answer has been arbitrarily


altered to have up to o(√n) error

Module 2 Tutorial: Differential Privacy in the Wild 96


Sequential Composition
• If M1, M2, ..., Mk are algorithms that access a private
database D such that each Mi satisfies εi -differential
privacy,

then running all k algorithms sequentially satisfies


ε-differential privacy with ε=ε1+...+εk

Module 2 Tutorial: Differential Privacy in the Wild 97


Privacy as Constrained Optimization
• Three axes
– Privacy
– Error
– Queries that can be answered

• E.g.: Given a fixed set of queries and privacy


budget ε, what is the minimum error that can
be achieved?

Module 2 Tutorial: Differential Privacy in the Wild 98


Parallel Composition
• If M1, M2, ..., Mk are algorithms that access
disjoint databases D1, D2, …, Dk such that each
Mi satisfies εi -differential privacy,

then running all k algorithms in “parallel”


satisfies ε-differential privacy
with ε= max{ε1,...,εk}

Module 2 Tutorial: Differential Privacy in the Wild 99


Postprocessing
• If M1 is an ε-differentially private algorithm that
accesses a private database D,

then outputting M2(M1(D)) also satisfies ε-


differential privacy.

Module 2 Tutorial: Differential Privacy in the Wild 100


Case Study: K-means Clustering

Module 2 Tutorial: Differential Privacy in the Wild 101


Kmeans
• Partition a set of points x1, x2, …, xn into k
clusters S1, S2, …, Sk such that the following is
minimized:
!
!
!! − !! !
!
!!! !! ∈!!
!

Mean of the cluster Si

Module 2 Tutorial: Differential Privacy in the Wild 102


Kmeans
Algorithm:
• Initialize a set of k centers
• Repeat
Assign each point to its nearest center
Recompute the set of centers
Until convergence …

• Output final set of k centers


Module 2 Tutorial: Differential Privacy in the Wild 103
[BDMN 05]
Differentially Private Kmeans
• Suppose we fix the number of iterations to T

• In each iteration (given a set of centers):

1. Assign the points to the new center to form clusters

2. Noisily compute the size of each cluster

3. Compute noisy sums of points in each cluster

Module 2 Tutorial: Differential Privacy in the Wild 104


Differentially Private Kmeans
• Suppose we fix the number of iterations to T
Each iteration uses ε/T privacy budget, total privacy loss is ε

• In each iteration (given a set of centers):

1. Assign the points to the new center to form clusters

2. Noisily compute the size of each cluster

3. Compute noisy sums of points in each cluster

Module 2 Tutorial: Differential Privacy in the Wild 105


Differentially Private Kmeans
• Suppose we fix the number of iterations to T
Exercise: Which of these steps expends privacy budget?

• In each iteration (given a set of centers):

1. Assign the points to the new center to form clusters

2. Noisily compute the size of each cluster

3. Compute noisy sums of points in each cluster

Module 2 Tutorial: Differential Privacy in the Wild 106


Differentially Private Kmeans
• Suppose we fix the number of iterations to T
Exercise: Which of these steps expends privacy budget?

• In each iteration (given a set of centers):

1. Assign the points to the new center to form clusters NO

2. Noisily compute the size of each cluster YES

3. Compute noisy sums of points in each cluster YES

Module 2 Tutorial: Differential Privacy in the Wild 107


Differentially Private Kmeans
• Suppose we fix the number of iterations to T
What is the sensitivity?

• In each iteration (given a set of centers):

1. Assign the points to the new center to form clusters

2. Noisily compute the size of each cluster 1

3. Compute noisy sums of points in each cluster Domain


size

Module 2 Tutorial: Differential Privacy in the Wild 108


Differentially Private Kmeans
• Suppose we fix the number of iterations to T
Each iteration uses ε/T privacy budget, total privacy loss is ε

• In each iteration (given a set of centers):

1. Assign the points to the new center to form clusters

2. Noisily compute the size of each cluster Laplace(2T/ε)

3. Compute noisy sums of points in each cluster


Laplace(2T |dom|/ε)

Module 2 Tutorial: Differential Privacy in the Wild 109


Results (T = 10 iterations, random initialization)
Original Kmeans algorithm Laplace Kmeans algorithm

• Even though we noisily compute centers, Laplace kmeans can distinguish


clusters that are far apart.

• Since we add noise to the sums with sensitivity proportional to |dom|,


Laplace k-means can’t distinguish small clusters that are close by.

Module 2 Tutorial: Differential Privacy in the Wild 110


Summary
• Differentially private algorithms ensure an
attacker can’t infer the presence or absence of a
single record in the input based on any output.

• Building blocks
– Laplace, exponential mechanism and randomized
response
• Composition rules help build complex
algorithms using building blocks

Module 2 Tutorial: Differential Privacy in the Wild 111


MODULE 3:
ANSWERING QUERIES ON
TABULAR DATA
Module 3 Tutorial: Differential Privacy in the Wild 112
Module 3: Answering queries on
Tabular data
• Answering query workloads on tabular databases

• Theory: two seminal results

• Survey of algorithm design ideas


– Low dimensional range queries
– Queries on high dimensional data

• Open Questions
Module 3 Tutorial: Differential Privacy in the Wild 113
Problem Formulation
• Input:
– Private database D consisting of a single table
(each tuple represents data of single individual)
– Workload W of counting queries* with arbitrary predicates

SELECT COUNT(*) FROM D WHERE <P>;

• Output: (noisy) answers to W


• Requirement: query answering algorithm satisfies
differential privacy

* Many techniques can also support linear queries: SELECT


SUM(f(t))
Module 3
FROM D where user-defined f maps tuple to [0,1]
Tutorial: Differential Privacy in the Wild 114
Analysis of temporal & spatial patterns
Counting query:
SELECT COUNT(*)
FROM Tweets
WHERE
moodScale=k
AND t <= time
AND time < t+1
AND UScounty = C

Module 3 http://www.ccs.neu.edu/home/amislove/twittermood/
Tutorial: Differential Privacy in the Wild 115
Statistical agencies: data publishing
• U.S. Census Bureau publishes
statistics that can typically be
derived from marginals
• A marginal over attributes
𝐴1, … , 𝐴𝑘 reports count for
each combination of attribute
values.
– aka cube, contingency table
– E.g. 2-way marginal on
EmploymentStatus and Gender
• Thousands of marginals
released

https://factfinder.census.gov/
Module 3 Tutorial: Differential Privacy in the Wild 116
[WWLTRD09]

Genome Wide Association Studies


• Goal: to study genetic factors associated with a given disease
• Collect subsets of the population with diseases (case) and without
(control)
• Extract SNPs (specific DNA subsequences)
– For each SNP, usually find 2 alleles (alternative forms of the gene)

• Counting queries: Compute allele


frequencies in both the case and
control groups
(marginal over SNPxDisease)

• Perform association test using these frequencies (e.g., Chi Square Test)
to identify SNPs highly associated with disease

Module 3 Tutorial: Differential Privacy in the Wild 117


Problem variant: offline vs. online
• Offline (batch):
– Entire W given as input, answers computed in batch
• Online (adaptive):
– W is sequence q1, q2, … that arrives online
– Adaptive: analyst’s choice for qi can depend on
answers 𝑎1, … , 𝑎'()
• Answering linear queries online is strictly harder
than answering them offline [BSU16].

Module 3 Tutorial: Differential Privacy in the Wild 118


Important aspects of problem:
Data and query complexity
• Data complexity
– Dimensionality: number of attributes
– Domain size: number of distinct attribute combinations
– Many techniques specialized for low dimensional data
• Query complexity
– Many techniques designed to work well for a specific
class of queries
– Classes (in rough order of difficulty): histograms, range
queries, marginals, counting queries, linear queries

Module 3 Tutorial: Differential Privacy in the Wild 119


Solution variants: query answers vs.
synthetic data
Two high-level approaches to solving problem
1. Direct:
– Output of the algorithm is list of query answers
2. Synthetic data:
– Algorithm constructs a synthetic dataset D’, which can
be queried directly by analyst
– Analyst can pose additional queries on D’
(though answers may not be accurate)

Module 3 Tutorial: Differential Privacy in the Wild 120


Theory
• Given negative result of Dinur-Nissim, is there any
hope?
• Yes!
– The key to Dinur-Nissim is that query answers have
independent noise (which can cancel out)
– To answer more queries, query error must be correlated
• Examples of correlation
– Use some query answers to approximate others
– Construct a synthetic database that is approximately
accurate for queries of interest

Module 3 Tutorial: Differential Privacy in the Wild 121


[BLR08]

Answering Exponentially Many Queries


Offline
• Key technical insight: For any set of count
queries 𝑊, there exists a small database 𝐷’
consistent with 𝐷 on every query in 𝑊.
– Small: 𝑂(log (|𝑊|)/𝛼2)
– Consistent: error for any q in W is at most ⍺
• Result follows from learning theory:
– Estimates on small random sample will generalize to
population

Module 3 Tutorial: Differential Privacy in the Wild 122


[BLR08]

Answering Exponentially Many Queries


Offline
• Input: 𝑊, 𝜀
• Output: 𝐷′
• The Mechanism:
– 𝑇 = {all small databases 𝐷’}
– 𝑓 𝐷, 𝐷’ = −𝑚𝑎𝑥@∈B |𝑞(𝐷) – 𝑞(𝐷’)|
– Output 𝐷’ ∈ 𝑇 using Exponential
Mechanism applied to 𝑓
• Theorem: Is 𝜀-private and w.h.p. error ⍺ is at most
log |domain| log 𝑊 )/J
𝑂( )
𝜀𝐷
Module 3 Tutorial: Differential Privacy in the Wild 123
[BLR08]

Answering Exponentially Many Queries


Offline
• Input: 𝑊, 𝜀
• Output: 𝐷′
• The Mechanism:
– 𝑇 = {all small databases 𝐷’}
– 𝑓 𝐷, 𝐷’ = −𝑚𝑎𝑥@∈B |𝑞(𝐷) – 𝑞(𝐷’)|
– Output 𝐷’ ∈ 𝑇 using Exponential Limitations
Mechanism applied to 𝑓 • Impractical:
runtime
• Theorem: Is 𝜀-private and w.h.p. error ⍺ is exponential
at most
log |domain| log 𝑊 • Offline
)/J
(for online
𝑂( ) [HR10])
see
𝜀𝐷
Module 3 Tutorial: Differential Privacy in the Wild 124
Case study: range queries over spatial
data
Input: sensitive data D Input: range query
workload W

Shown is workload of 3
range queries

BeijingTaxi dataset[1]:
4,268,780 records of (lat,lon)
pairs of taxi pickup locations
in Beijing, China in 1 month.

Scatter plot of input data

Task: compute answers to workload W over private input D

[1]Raw data from: Taxi trajectory open


Module 3 Tutorial: Differential Privacy in the Wild dataset, Tsinghua university,125
China.
http://sensor.ee.tsinghua.edu.cn, 2009.
Baseline algorithm
1. Discretize attribute domain
into cells
2. Add noise to cell counts
(Laplace mechanism)
3. Use noisy counts to either…
1. Answer queries directly
(assume distribution is
Scatter plot of input data uniform within cell)
2. Generate synthetic data
Limitations (derive distribution from
• Granularity of discretization counts and sample)
– Coarse: detail lost
– Fine: noise overwhelms signal
• Noise accumulates: squared error
grows linearly with range
Module 3 Tutorial: Differential Privacy in the Wild 126
Error analysis
Error baseline algorithm
[𝑥) , 𝑥) ] on every range query

• Dataset is 1D: single


attribute, 128 possible values
• Baseline simply adds noise to [𝑥)KL, 𝑥)KL]
every cell count
• Incurs high error on large
ranges
[𝑥) , 𝑥)KL]
Large range Small range

Error
Module 3 Tutorial: Differential Privacy in the Wild 127
[CPSSY12]

Improved algorithm: private quad-tree


• Input: maximum height 𝒉, minimum
leaf size 𝑳, data set D
• Recurse on nodes of tree:
– Add 𝐿𝑎𝑝(1/𝜀 ) noise to node count
– Split node domain into quadrants
– Create child nodes
• Stop when:
– Noisy count of node ≤ 𝑳
– Max height 𝒉 is reached
quadtree
• Intuition:
– Early stopping controls granularity of
discretization
– To answer long range queries, leverage
hierarchy of noisy counts

Module 3 Tutorial: Differential Privacy in the Wild 128


[CPSSY12]

Improved algorithm: private quad-tree


• Input: maximum height 𝒉, minimum
leaf size 𝑳, data set D
• Recurse on nodes of tree:
– Add 𝐿𝑎𝑝(1/𝜀 ) noise to node count
Exercise: Let ℎ′ ≤ ℎ be – Split node domain into quadrants
height of resulting tree. – Create child nodes
Algorithm satisfies ε’- • Stop when:
differential privacy for ε’ – Noisy count of node ≤ 𝑳
equal toquadtree – Max height 𝒉 is reached
• Intuition:
1. εℎ′ – Early stopping controls granularity of
2. εℎ discretization
– To answer long range queries, leverage
3. 4εℎ hierarchy of noisy counts
4. ε4h
Module 3 Tutorial: Differential Privacy in the Wild 129
Using tree structure for range queries
• Quad-tree, each node has noisy count. How use
to answer range queries?
• Idea 1: given range query q, find smallest set of
noisy counts from tree that “cover” q
• Idea 2 (better):
– Observation: node’s noisy count and sum of
children’s noisy counts are two estimates of same
quantity
– Combine estimates using statistical inference

Module 3 Tutorial: Differential Privacy in the Wild 130


Error comparison
Error tree algorithm Error baseline algorithm
on every range query on every range query

Error Error
Module 3 Tutorial: Differential Privacy in the Wild 131
Hierarchy lowers error on large ranges but incurs
Error comparison
slightly higher error for small ranges

Error tree algorithm Error baseline algorithm


on every range query on every range query

Error Error
Module 3 Tutorial: Differential Privacy in the Wild 132
Credit: Cormode

Data transformations
• Can think of trees as transform of input

Original Transform Noise Noisy “Invert”


Coefficients Private Data
Data Coefficients

• Can apply other data transformations


• Goal: pick a low sensitivity transform that preserves
good properties of data
Module 3 Tutorial: Differential
133Privacy in the Wild
Linear transformations
• Examples
– Hierarchical: Trees [HRMS10,QYL13], full height quadtree
[CPSSY12]
– Haar Wavelet [XWG10]
– Discrete Fourier transform [BCDKMT07]
• Inverting transformation
– Some transformations (e.g. tree) have redundancy (over-
constrained), so require pseudo-inverse
• Matrix Mechanism [LHRMM10,LM12,LM13]
– Formalizes problem of designing a linear transform that is tailored
to the queries
• Error rates are independent of input (assumes linear transform
is “full rank”)

Module 3 Tutorial: Differential Privacy in the Wild 134


Lossy transformations
• Variants
– Drop “small” coefficients:
• Quad-tree with early stopping (noisy count threshold)
• Fourier coefficients: EFPA [ACC12], [RN10]
– Data-adaptive discretization:
• PrivTree [ZXX16], KD-Tree [CPSSY12], DAWA [LHMY14], [DNRR15], [QYL13], [BLR08]
– Data-adaptive measurement:
• MWEM [HLM12], DualQuery [GAHRW14]
– Randomized transforms: sketches and compressed sensing
• JL Transform [BBDS12], Compressive mechanism [LZWY11]
• “Inverting” transformation
– Because lossy, they are under-constrained, requires estimation
• Error rates depend on input
– Can be much lower (trades off small bias for lower variance)
– Warrants careful empirical evaluation; algorithms are “data dependent”

Module 3 Tutorial: Differential Privacy in the Wild 135


High dimensional data
• Generally an under-studied area
• Two algorithms, both synthetic data generators
– PrivBayes [ZCPSX14]
– DualQuery [GAHRW14]
• Common properties
– Limited to binary attributes
– Designed to support low-order marginals
(and other workloads well approximated by
marginals, such as classification)

Module 3 Tutorial: Differential Privacy in the Wild 136


[ZCPSX14]

PrivBayes ABC
CD …..

ABCDEFG…………… BE DEF
decompose
High-dimensional table 𝑇 Low-dimensional tables

• Method: Add noise


• Use Bayesian network to learn data distribution
• After BN learned, generate synthetic data by
sampling from BN
• Challenge: privately choosing good decomposition CD …..
ABC
ABCDEFG…………… DEF
BE
Noisy table 𝑇 ∗ reconstruct
Noisy tables
Module 3 Tutorial: Differential Privacy in the Wild 137
[GAHRW14]

Dual Query
• Problem of generating synthetic data formulated as a zero sum game
between
– Data player: generates synthetic records to reduce query error
– Query player: chooses queries with high error (on current synthetic dataset)
– Theoretical analysis of utility comes from studying equilibrium of game

1. Let Qt be distribution over queries (Q1 is uniform) • What makes it practical?


2. For t = 1… T – Unlike some prior work[HLM12],
avoids storing distribution over
a) S ç Query player samples s domain (exponential)
queries from Qt Privacy – Approximate solution may be
good enough!
– Optimization problem can be
b) Data player finds record xt that solved with off-the-shelf solvers
NP-Hard
maximizes total answer on S • Case study on 500K 3-way
marginals over 17K binary
attributes, using CPLEX solver
c) Qt+1ç Query player updates(xt, D, Qt)
3. Output synthetic database x1, …, xt

Module 3 Tutorial: Differential Privacy in the Wild 138


[HMMCZ16]

Empirical benchmarks
• [HMMCZ16] propose a novel evaluation framework for
standardized evaluation of privacy algorithms.
• Study of algorithms for range query answering over 1 and 2D
• Benchmark website www.dpcomp.org

Some data-
dependent
algorithms fail to
offer benefits at
larger scales (no.
of tuples)

Module 3 Tutorial: Differential Privacy in the Wild 139


Open questions
• Robust and private algorithm selection
– Pythia (Thursday 2 PM DP Session)
• Error bounds for data-dependent algorithms
• Empirical evaluation of algorithms for high
dimensional data

Module 3 Tutorial: Differential Privacy in the Wild 140


Differential Privacy References
CACM Articles
• [D11] Dwork, A firm foundation for private data analysis. In CACM, 2011.
• [MK15] Machanavajjhala & Kifer, Designing statistical privacy for your data. In CACM, 2015.
Book
• [DR14] Dwork & Roth, The Algorithmic Foundations of Differential Privacy. In Foundations
and Trends, 2014.
Tutorials
• [C13] Graham Cormode. Building blocks of privacy: Differentially private mechanisms. In
PrivDB, 2013.
• [YZMWX12] Yang et al., Differential privacy in data publication and analysis. In SIGMOD, 2012.
• [HLMPT11] Hay et al., Privacy-aware Data Management in Information Networks. In SIGMOD,
2011.
• [CS14]K. Chaudhuri and A. D. Sarwate. Differential privacy for signal processing and machine
learning. In WIFS, 2014.
• [KNRS13] S. P. Kasiviswanathan, K. Nissim, S. Raskhodnikova, and A. Smith. Analyzing graphs
with node differential privacy. In TCC, 2013.

Tutorial: Differential Privacy in the Wild 141


Module 1 References
• [S02] Sweeney, “K-anonymity”, IJFUKS 2010
• [DN03] Dinur, Nissim, “Revealing information while preserving privacy”,
PODS 2003
• [D06] Dwork, “Differential Privacy”, ICALP 2006
• [MKGV06] Machanavajjhala, Kifer, Gehrke, Venkitasubramaniam, “L-
Diversity” ICDE 2006
• [GKS08] Ganta, Kasiviswanathan, Smith, “Composition attacks and auxiliary
information in data privacy”, KDD 2008
• [KL10] Kifer, Lin, “Towards an Axiomatization of Statistical Privacy and
Utility.”, PODS 2010
• [VSJO13] Vaidya, Shafiq, Jiang, Ohno-Machado, “Identifying inference
attacks against healthcare data repositories”, AMIA 2013
• [MK15] Machanavajjhala, Kifer, “Designing statistical privacy for your data”,
CACM 2015
Module 1 Tutorial: Differential Privacy in the Wild 142
Module 2 References
• [W65] Warner, “Randomized Response” JASA 1965
• [DN03] Dinur, Nissim, “Revealing information while preserving privacy”,
PODS 2003
• [BDMN05] Blum, Dwork, McSherry, Nissim, “Practical privacy: the SuLQ
framework”, PODS 2005
• [D06] Dwork, “Differential Privacy”, ICALP 2006
• [DMNS06] Dwork, McSherry, Nissim, Smith, “Calibrating noise to sensitivity
in private data analysis”, TCC 2006
• [MT07] McSherry, Talwar, “Mechanism Design via Differential Privacy”,
FOCS 2007

Module 2 Tutorial: Differential Privacy in the Wild 143


Module 3 References
• [ACC12] Ács et al. Differentially private histogram publishing through lossy compression. In ICDM, 2012.
• [BBDS12] Blocki et al. The johnson-lindenstrauss transform itself preserves differential privacy. In FOCS, 2012.
• [BCDKMT07] Barak et al. Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In PODS, 2007.
• [BLR08] Blum et al. A learning theory approach to noninteractive database privacy. In STOC, 2008.
• [DNRR15] Dwork et al. Pure Differential Privacy for Rectangle Queries via Private Partitions. In ASIACRYPT, 2015.
• [CPSSY12] Cormode et al. Differentially Private Spatial Decompositions. In ICDE, 2012.
• [GAHRW14] Gaboardi et al. Dual Query: Practical Private Query Release for High Dimensional Data. In ICML, 2014.
• [HLM12] Hardt et al. A simple and practical algorithm for differentially private data release. In NIPS, 2012.
• [HMMCZ16] Hay et al. Principled Evaluation of Differentially Private Algorithms using DPBench. In SIGMOD, 2016.
• [HRMS10] Hay et al. Boosting the accuracy of differentially private histograms through consistency. In PVLDB, 2010.
• [LHMY14] Li et al. A data- and workload-aware algorithm for range queries under differential privacy. In PVLDB, 2014.
• [LHRMM10] Li et al. Optimizing linear counting queries under differential privacy. In PODS, 2010.
• [LM12] Li et al. An adaptive mechanism for accurate query answering under differential privacy. In PVLDB, 2012.
• [LM13] Li et al. Optimal error of query sets under the differentially-private matrix mechanism. In ICDT, 2013.
• [LZWY11] Li et al. Compressive mechanism: utilizing sparse representation in differential privacy. In WPES, 2011.
• [QYL13] Qardaji et al. Understanding hierarchical methods for differentially private histograms. In PVLDB, 2013.
• [QYL13] Qardaji et al. Differentially private grids for geospatial data. In ICDE, 2013.
• [RN10] Rastogi et al. Differentially private aggregation of distributed time-series with transformation and encryption. In SIGMOD, 2010.
• [WWLTRD09] Wang et al. Privacy-preserving genomic computation through program specialization. In CCS, 2009.
• [XWG10] Xiao et al. Differential privacy via wavelet transforms. In ICDE, 2011.
• [ZCPSX14] Zhang et al. PrivBayes: private data release via bayesian networks. In SIGMOD, 2014.
• [ZXX16] Zhang et al. PrivTree: A Differentially Private Algorithm for Hierarchical Decompositions. In SIGMOD, 2016.

Module 3 Tutorial: Differential Privacy in the Wild 144


Differential Privacy in the Wild
(Part 2)
A Tutorial on Current Practices and Open Challenges
Outline of the Tutorial
1. What is Privacy?
2. Differential Privacy
3. Answering Queries on Tabular Data
Break
4. Applications I: Machine Learning
5. Privacy in the Real World
6. Applications II: Networks and Trajectories

Tutorial: Differential Privacy in the Wild 2


MODULE 4:
APPLICATIONS I: MACHINE
LEARNING
Module 4 Tutorial: Differential Privacy in the Wild 3
Module 4: Applications I
• Private Empirical Risk Minimization
– E.g. SVM, logistic regression
– Make a specific learning approach private

• Private Stochastic Gradient Descent


– E.g. Deep learning
– Make a general purposed fitting technique private

• Other Important Problems in Private Learning


Module 4 Tutorial: Differential Privacy in the Wild 4
Differentially Private Machine Learning

Predicts flu or not, based on patient symptoms


Trained on sensitive patient data Credit: Chaudhuri

Module 4 Tutorial: Differential Privacy in the Wild 5


From Attributes to Labeled Data

Yes No 99F No

Sore Throat Fever Temperature Flu?

1 0 99 -
Data Label

Module 4 Tutorial: Differential Privacy in the Wild 6


Classifying Sensitive Data

- - - ++
--- ++ +
Classification
- -- + Algorithm
-
- +
Private Public
Data Classifier

Module 4 Tutorial: Differential Privacy in the Wild 7


Classifying Sensitive Data

- - - ++
--- ++ Distribution 𝑃 over
- -- + labelled examples
- +

Goal: Find a vector 𝑤 that separates + from - for


most points from 𝑃

Key: Find a simple model to fit the samples

Module 4 Tutorial: Differential Privacy in the Wild 8


Empirical Risk Minimization
• Training dataset:
– Labeled data 𝐷 = 𝑥𝑖, 𝑦𝑖 ∈ 𝑋×𝑌: 𝑖 = 1,2, … , 𝑛
– e.g binary classification 𝑋 = 𝑅 4 , 𝑌 = {−1, +1}
– Train predictor over 𝐷: 𝜔: 𝑋 → 𝑌

• Empirical risk (or error) of 𝜔 over 𝐷 is


@
1
; 𝑙(𝜔, (𝑥> , 𝑦> ))
𝑛
>AB
– 𝑙 is a loss function: how well 𝜔 classifies (𝑥> , 𝑦> )

Module 4 Tutorial: Differential Privacy in the Wild 9


Examples of Loss Function
- - - ++
--- ++
- -- +
- +
Risk: Hinge loss 𝑙 𝑧 = max (0,1 − 𝑧)
Optimizer: Support vector machines (SVM)

Risk: Logistic loss 𝑙 𝑧 = log (1 + exp −𝑧 )


Optimizer: Logistic regression

Module 4 Tutorial: Differential Privacy in the Wild 10


Regularized ERM
• Goal: Labeled data 𝐷 = {(𝑥𝑖 , 𝑦𝑖 )}, find

@
1 U
1
𝑓 𝐷 = 𝑎𝑟𝑔𝑚𝑖𝑛R 𝜆 ∥ 𝜔 ∥ + ; 𝑙(𝜔, 𝑥> , 𝑦> )
2 𝑛
>AB

Regularizer Risk
(Model Complexity) (Training Error)

Module 4 Tutorial: Differential Privacy in the Wild 11


Why ERM is not private for
Support Vector Machine (SVM)?
-
--- +
+
--- + +
- +
+

SVM solution is a combination of support vectors


If one support vector moves, solution changes

Module 4 Tutorial: Differential Privacy in the Wild 12


Why ERM is not private for
Support Vector Machine (SVM)?
-
--- +
+
--- +
- +
+
+

SVM solution is a combination of support vectors


If one support vector moves, solution changes

Module 4 Tutorial: Differential Privacy in the Wild 13


Module 4: Applications I
• Private Empirical Risk Minimization
– E.g. SVM, logistic regression
– Make a specific learning approach private

• Private Stochastic Gradient Descent


– E.g Deep learning
– Make a general purposed fitting technique private

• Other Important Problems in Private Learning


Module 4 Tutorial: Differential Privacy in the Wild 14
How to make ERM private?

-
--- +
+ Pick 𝜔 from distribution
--- + + near the optimal solution
- +
+

Module 4 Tutorial: Differential Privacy in the Wild 15


Output Perturbation
• Goal:
𝑓V 𝐷 = 𝑓 𝐷 + 𝑛𝑜𝑖𝑠𝑒 =
@
1 U
1
Z𝑎𝑟𝑔𝑚𝑖𝑛R 𝜆 ∥ 𝜔 ∥ + ; 𝑙(𝜔, 𝑥> , 𝑦> )[ + 𝑛𝑜𝑖𝑠𝑒
2 𝑛
>AB

Theorem: [CMS11] If ∥ 𝑥> ∥≤ 1 and 𝑙 is 1-Lipschitz,


then for any 𝐷, 𝐷′ with dist 𝐷, 𝐷b = 1,
b U
||𝑓 𝐷 − 𝑓(𝐷 )||U ≤ (𝐿U -sensitivity)
d@

Module 4 Tutorial: Differential Privacy in the Wild 16


Output Perturbation
• Goal:
𝑓V 𝐷 = 𝑓 𝐷 + 𝑛𝑜𝑖𝑠𝑒 =
@
1 U
1
Z𝑎𝑟𝑔𝑚𝑖𝑛R 𝜆 ∥ 𝜔 ∥ + ; 𝑙(𝜔, 𝑥> , 𝑦> )[ + 𝑛𝑜𝑖𝑠𝑒
2 𝑛
>AB

• Laplace 𝑛𝑜𝑖𝑠𝑒 drawn from


U
– Magnitude: drawn from Γ(d, )
d@g
– Direction: uniform at random

Module 4 Tutorial: Differential Privacy in the Wild 17


Property of Real Data

-
--- +
+
--- + + Loss
- +
+

Perturbation

Optimization surface is very steep in some direction


à High loss if perturbed in those directions
Module 4 Tutorial: Differential Privacy in the Wild 18
Objective Perturbation
• Insight: Perturb optimization surface and then
optimize
𝑓V 𝐷 =
@
1 U
1
𝑎𝑟𝑔𝑚𝑖𝑛R Z 𝜆 ∥ 𝜔 ∥ + ; 𝑙(𝜔, 𝑥> , 𝑦> ) + 𝑛𝑜𝑖𝑠𝑒[
2 𝑛
>AB

• Main idea: add noise as part of the computation:


– Regularization already changes the objective to protects
against overfitting.
– Change the objective a little bit more to protect privacy.

Module 4 Tutorial: Differential Privacy in the Wild 19


Objective Perturbation
• Insight: Perturb optimization surface and then
optimize
𝑓V 𝐷 =
@
1 U
1
𝑎𝑟𝑔𝑚𝑖𝑛R Z 𝜆 ∥ 𝜔 ∥ + ; 𝑙(𝜔, 𝑥> , 𝑦> ) + 𝑛𝑜𝑖𝑠𝑒[
2 𝑛
>AB

• 𝑛𝑜𝑖𝑠𝑒 drawn from


B
– Magnitude: drawn from Γ(d, )
g
– Direction: uniform at random

Module 4 Tutorial: Differential Privacy in the Wild 20


Objective Perturbation
• Insight: Perturb optimization surface and then
optimize
𝑓V 𝐷 =
@
1 U
1
𝑎𝑟𝑔𝑚𝑖𝑛R Z 𝜆 ∥ 𝜔 ∥ + ; 𝑙(𝜔, 𝑥> , 𝑦> ) + 𝑛𝑜𝑖𝑠𝑒[
2 𝑛
>AB

• Theorem: If 𝑙 is convex and double-differentiable


with 𝑙′ z ≤ 1, 𝑙 bb z ≤ 𝑐 then Algorithm
k
satisfy 𝜖 + 2 log 1 + -DP. [CMS11]
@d

Module 4 Tutorial: Differential Privacy in the Wild 21


Accuracy
• Number of samples for error 𝛼 w.r.t the best predictor
– Fewer samples implies higher accuracy
𝑑: #dimensions
𝛾: margin
𝜖: privacy
𝛼: error
𝛾, 𝛼, 𝜖 < 1
B
• Normal SVM:
mn U
B 4
• Objective perturbation: +
mn U mgn
B 4
• Output perturbation: + o.q
mn U m gn

Module 4 Tutorial: Differential Privacy in the Wild 22


Module 4: Applications I
• Private Empirical Risk Minimization
– E.g. SVM, logistic regression
– Make a specific learning approach private

• Private Stochastic Gradient Descent


– E.g. Deep learning
– Make a general purposed fitting technique private

• Other Important Problems in Private Learning


Module 4 Tutorial: Differential Privacy in the Wild 23
Stochastic Gradient Descent (SGD)
• Initial 𝜔u
• Incremental gradient update for 𝑡 = 0 … 𝑇 − 1
– Take a random example 𝑥x , 𝑦x ∈ 𝐷
– Update 𝜔xyB = 𝜔x − 𝜂x (𝛻𝑙 𝜔x , 𝑥x , 𝑦x )
• 𝜂x is the step size

• Permutation-based SGD (PSGD)


– Randomly permute training examples 𝐷 = {(𝑥𝑖 , 𝑦𝑖 )} to
feed each pass of SGD
– Cycle 𝐷 for 𝑘 times: 𝑘-pass PSGD

Module 4 Tutorial: Differential Privacy in the Wild 24


White Box Approaches
• Initial 𝜔u
• Incremental gradient update for 𝑡 = 0 … 𝑇 − 1
– Take a random example 𝑥x , 𝑦x ∈ 𝐷
– Update 𝜔xyB = 𝜔x − 𝜂x (𝛻𝑙 𝜔x , 𝑥x , 𝑦x + 𝑛𝑜𝑖𝑠𝑒)
• 𝜂x is the step size

• Permutation-based SGD (PSGD)


– Randomly permute training examples 𝐷 = {(𝑥𝑖 , 𝑦𝑖 )} to
feed each pass of SGD
– Cycle 𝐷 for 𝑘 times: 𝑘-pass PSGD

Module 4 Tutorial: Differential Privacy in the Wild 25


White Box Approaches
• Cycle 𝐷 for 𝑘 times
• Basic composition:
– Each pass is 𝜖-DP, then 𝑘-pass is 𝜖𝑘-DP.
– Privacy loss grows linearly with the number of
passes. [CSC13, SS15]
• Tighter privacy loss with advanced composition
– Convex objectives [JKT12, BST14]
– Deep learning with non-convex objectives [ACG16]

Module 4 Tutorial: Differential Privacy in the Wild 26


Advanced Composition
[DRV10]

• Composing 𝑘 algorithms, each satisfying 𝜖-DP


ensures 𝜖} -DP with probability 1 − 𝛿
B
𝜖} = 𝑂 𝜖 𝑘 ln + 𝑘𝜖 U

• Analyze privacy loss as a random variable: given


output 𝑜 and neighbors 𝐷, 𝐷b
ƒ„ [† ‡ Aˆ]
𝑃𝐿 𝑜 = ln Š
ƒ„ [† ‡ Aˆ]

Module 4 Tutorial: Differential Privacy in the Wild 27


Advanced Composition
[DRV10]

• Composing 𝑘 algorithms, each satisfying 𝜖-DP


ensures 𝜖} -DP with probability 1 − 𝛿
B
𝜖} = 𝑂 𝜖 𝑘 ln + 𝑘𝜖 U

• Each algorithm has privacy loss 𝑃𝐿(𝑜)


– Worst case (DP): Pr 𝑃𝐿 𝑜 ≤ 𝜖 = 1
– Expected loss: E 𝑃𝐿(𝑜) ≤ 𝜖(𝑒 g −1)
– Total privacy loss 𝜖} is bounded by Azuma’s inequality

Module 4 Tutorial: Differential Privacy in the Wild 28


Black Box Approaches
• Add noise to the final output of SGD [WLK17]
– No need code change to the SGD program
– Only sample noise once
– Allow 𝜖-DP and (𝜖, 𝛿)-DP
– Better convergence for constant number of
passes based on the new bound over 𝐿U sensitivity
of 𝑘-pass PSGD Initialize 𝜔u
For 𝑡 = 0 … 𝑇 − 1
“Bolt-on DP” 𝜔xyB ← update 𝜔x
@ Thursday 2 PM DP Session) Output 𝜔 •
𝜔 • ← 𝜔 • + noise
Module 4 Tutorial: Differential Privacy in the Wild 29
𝐿U sensitivity of 𝑘-pass PSGD
• 𝑙 is 𝛽-smooth and 𝐿-Lipschitz, the 𝐿U sensitivity is
U
– 2𝑘𝐿𝜂 if 𝑙 is convex, 𝜂x = 𝜂 ≤

U’ B B
– if 𝑙 is 𝜆-strongly convex, 𝜂x = min( , ), |𝐷| = 𝑛
d@ ‘ dx
– Convergence when 𝑘 = 𝑂(1)
[WLK17] (black box) [BST14] (white box)
Convex “
𝑑 𝑑 log U 𝑛
𝑂( ) 𝑂( )
𝑛 𝑛
Strongly 𝑑 log 𝑛 𝑑 log U 𝑛
convex 𝑂( ) 𝑂( )
𝑛 𝑛

Module 4 Tutorial: Differential Privacy in the Wild 30


Other Fitting Techniques
• Mini-batching SGD
– At step 𝑡, the gradient is updated with a batch of
examples 𝐵x from 𝐷
– Add noise per iteration
• 𝜔xyB = 𝜔x − 𝜂x (𝐸 –— ,˜— ∈™š 𝛻𝑙 𝜔x , 𝑥> , 𝑦> + 𝑛𝑜𝑖𝑠𝑒)
– Or add noise to the final output

• Proximal algorithm for strongly convex


optimization [JKT12]
– Add noise per iteration
– Hard to implement than SGD
Module 4 Tutorial: Differential Privacy in the Wild 31
Module 4: Applications I
• Private Empirical Risk Minimization
– E.g. SVM, logistic regression
– Make a specific learning approach private

• Private Stochastic Gradient Descent


– E.g. Deep learning
– Make a general purposed fitting technique private

• Other Important Problems in Private Learning


Module 4 Tutorial: Differential Privacy in the Wild 32
Other Important Problems
• Practical issues
– Parameter tuning: exponential mechanism [CMS11]
– High dimensional data: random projection [WLK17]

• Solve non-convex optimization


– Deep learning [SS15, ACG16]

• Understand what can be learned privately [KLNR11]


– Private learning w/o efficiency: PAC, SQ
– What cannot be learned privately? e.g. threshold functions
where hypothesis space is infinite

Module 4 Tutorial: Differential Privacy in the Wild 33


DP Algorithms for ML
• Private ERM– a specific learning approach
Output perturbation Objective perturbation
argmin (objective) + noise argmin (objective + noise)

• Private SGD – a fitting technique


White box approaches Black box approaches
Initialize 𝜔u Initialize 𝜔u
For 𝑡 = 0 … 𝑇 − 1 For 𝑡 = 0 … 𝑇 − 1
𝜔xyB ← update 𝜔x 𝜔xyB ← update 𝜔x
𝜔xyB ← 𝜔xyB + noise Output 𝜔 •
Output 𝜔 •
𝜔 • ← 𝜔 • + noise
Module 4 Tutorial: Differential Privacy in the Wild 34
MODULE 5:
PRIVACY IN THE REAL WORLD

Module 5 Tutorial: Differential Privacy in the Wild 35


Module 5: Privacy in the real world
• Real world deployments of differential privacy

– OnTheMap RAPPOR

• Privacy beyond Tabular Data


– No Free Lunch Theorem
– Customizing differential privacy using Pufferfish

Module 5 Tutorial: Differential Privacy in the Wild 36


http://onthemap.ces.census.gov/

Module 5 Tutorial: Differential Privacy in the Wild 37


Data underlying OnTheMap
• Employee • Job
– Age – Start date
– Sex – End date
– Race & Ethnicity – Worker & Workplace IDs
– Education – Earnings
– Home location (Census block)

• Employer
– Geography (Census blocks)
– Industry
– Ownership (Public vs Private)
Module 5 Tutorial: Differential Privacy in the Wild 38
Why release such data?

Module 5 Tutorial: Differential Privacy in the Wild 39


Why privacy is needed?

US Code: Title 13 CENSUS


It is against the law to make any publication whereby
the data furnished by any particular establishment or
individual under this title can be identified.

Violating the statutory confidentiality pledge can result


in fines of up to $250,000 and potential imprisonment
for up to five years.

Module 5 Tutorial: Differential Privacy in the Wild 40


OnTheMap
Workplace
Residence (Quasi-identifier)
(Sensitive)

Worker ID Origin Destination

1223 MD11511 DC22122


1332 MD2123 DC22122
1432 VA11211 DC22122
Census Blocks
2345 PA12121 DC24132
1432 PA11122 DC24132
1665 MD1121 DC24132
1244 DC22122 DC22122

Module 5 Tutorial: Differential Privacy in the Wild 41


Current approach: Synthetic Database
• Sanitize the dataset one time
• Analyst can perform arbitrary computations on
the synthetic datasets

• Unlike in query answering systems


– No need to maintain state (of queries asked)
– No need to track privacy loss across queries or
across analysts

Module 5 Tutorial: Differential Privacy in the Wild 42


Synthetic Residence Generator
(circa 2007)
Origin Destination # Noise
Workers
2
MD1151 DC22122 1
1 0
MD2123 DC22122 3 + 1
+ Dirichlet
VA11211 DC22122 12 2 Resampling
PA12121 DC24132 43 1
PA11122 DC24132 5 9
MD1121 DC24132 2 0
DC22122 DC22122 1

No noise is added to origin-destination pairs with true count 0


Can lead to re-identification attacks.
Module 5 Tutorial: Differential Privacy in the Wild 43
Differentially Private Synthetic Data
Generator
• Noise added to all origin-destination (o-d) pairs
– Even if 0 count in the original dataset

• Noise calibrated to ensure a variant called


probabilistic differential privacy

• Utility ensured by coarsening the domain and


probabilistically dropping o-d pairs with no
support.
Module 5 Tutorial: Differential Privacy in the Wild 44
Evaluation
• Utility measured by average commute distance for each
destination block.

Experimental Setup:
• OTM: Currently published
OnTheMap data used as original data.
• All destinations in Minnesota.
• 120,690 origins per destination.
• chosen by pruning out blocks that are >
100 miles from the destination.

• Total ε = 8.3, δ = 10-5

Module 5 Tutorial: Differential Privacy in the Wild 45


Module 5: Privacy in the real world
• Real world deployments of differential privacy

– OnTheMap RAPPOR

• Privacy beyond Tabular Data


– No Free Lunch Theorem
– Customizing differential privacy using Pufferfish

Module 5 Tutorial: Differential Privacy in the Wild 46


A dilemma
• Cloud services want to protect their users,
clients and the service itself from abuse.

• Need to monitor statistics of, for instance,


browser configurations.
– Did a large number of users have their home page
redirected to a malicious page in the last few hours?

• But users do not want to give up their data


Module 5 Tutorial: Differential Privacy in the Wild 47
Browser configurations can identify users

Module 5 Tutorial: Differential Privacy in the Wild 48


Problem [Erlingsson et al CCS’14]

What are the frequent unexpected


Chrome homepage domains?

à To learn malicious software


that change Chrome setting
without users’ consent
.
.
.

Finance.com WeirdStuff.com
Fashion.com

Module 5 Tutorial: Differential Privacy in the Wild 49


Why privacy is needed?
Liability (for server)
Storing unperturbed sensitive
data makes server
accountable (breaches,
subpoenas, privacy policy
violations)
.
.
.

Finance.com
WeirdStuff.com
Fashion.com

Module 5 Tutorial: Differential Privacy in the Wild 50


Solution
Can use Randomized Response …

On a binary domain:
With probability p report true value
With probability 1-p report false value

… but the domain of all urls is very large …


… original value is reported with very low prob.

Module 5 Tutorial: Differential Privacy in the Wild 51


RAPPOR Solution
• Idea 1: Use bloom filters to reduce the domain size

Participant 8456 in cohort 1


Finance.com
True value: "The number 68"
0 1
4 signal bits
Bloom filter (B): | | | |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||| ||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||

69 bits on
Fake Bloom
filter (B'): ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
145 bits on
Report sent
to server: | || | |||| || ||| | | ||| ||| || ||| || | | ||| | || || || | | ||| | ||| | | || ||| | || | ||| ||| |||||| || || ||| ||| || | ||||||||||||||||||||
| | | || | | ||| | | |||| | || | | || | || | || || || | | | || | || | | | ||| |||| | | ||| | || || || || | | | || | | ||| | | || | | | ||| | || | | | |||| | | || || |

1 8 32 64 128 256

Bloom filter bits

Figure 1: Life of a RAPPOR report: The client value of the string “The number 68” is hashed onto the Bloom
Module 5 using h (here 4) hash functions.
filter B Tutorial:
For Differential
this string,Privacy randomized response B 0 is produces and
in the Wild
a Permanent 52
memoized by the client, and this B 0 is used (and reused in the future) to generate Instantaneous randomized
RAPPOR Solution
• Idea 2: Use RR on bloom filter bits

Participant 8456 in cohort 1


Finance.com
True value: "The number 68"
0 1
4 signal bits
Bloom filter (B): | | | |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||| ||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||

69 bits on
Fake Bloom
filter (B'): ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
145 bits on
Report sent
to server: | || | |||| || ||| | | ||| ||| || ||| || | | ||| | || || || | | ||| | ||| | | || ||| | || | ||| ||| |||||| || || ||| ||| || | ||||||||||||||||||||
| | | || | | ||| | | |||| | || | | || | || | || || || | | | || | || | | | ||| |||| | | ||| | || || || || | | | || | | ||| | | || | | | ||| | || | | | |||| | | || || |

1 8 32 64 128 256

Bloom filter bits

Figure 1: Life of a RAPPOR report: The client value of the string “The number 68” is hashed onto the Bloom
Module 5 using h (here 4) hash functions.
filter B Tutorial:
For Differential
this string,Privacy in the
a Permanent randomized response B 0 is produces and
Wild 53
memoized by the client, and this B 0 is used (and reused in the future) to generate Instantaneous randomized
RAPPOR Solution
• Idea 3: Again use RR on the Fake bloom filter
Why randomize two times?

- Chrome collects information each day


Participant 8456 in cohort 1
Finance.com
True value:
- "The
Wantnumber
perturbed
68"values to look different
0 1
on different days to avoid linking
4 signal bits
Bloom filter (B): | | | |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||| ||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||

69 bits on
Fake Bloom
filter (B'): ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
145 bits on
Report sent
to server: | || | |||| || ||| | | ||| ||| || ||| || | | ||| | || || || | | ||| | ||| | | || ||| | || | ||| ||| |||||| || || ||| ||| || | ||||||||||||||||||||
| | | || | | ||| | | |||| | || | | || | || | || || || | | | || | || | | | ||| |||| | | ||| | || || || || | | | || | | ||| | | || | | | ||| | || | | | |||| | | || || |

1 8 32 64 128 256

Bloom filter bits

Figure 1: Life of a RAPPOR report: The client value of the string “The number 68” is hashed onto the Bloom
Module 5 using h (here 4) hash functions.
filter B Tutorial:
For Differential
this string,Privacy randomized response B 0 is produces and
in the Wild
a Permanent 54
0
memoized by the client, and this B is used (and reused in the future) to generate Instantaneous randomized
Server Report Decoding
• Step 5: estimates bit frequency from reports 𝑓V(𝐷)
• Step 6: estimate frequency of candidate strings with
regression from 𝑓V(𝐷)

1 1 0 1 0 0 0 1 0 1

0 1 0 1 0 0 0 1 0 0

.
.
.
0 1 0 1 0 0 0 1 0 1
𝑓V(𝐷)
.
.
.

23 12 12 12 12 2 3 2 1 10

Finance.com Fashion.com
WeirdStuff.com
Module 5 Tutorial: Differential Privacy in the Wild 55
Evaluation
http://google.github.io/rappor/examples/report.html

Module 5 Tutorial: Differential Privacy in the Wild 56


Other Real World Deployments
• Differentially private password Frequency lists [Blocki et al. NDSS ‘16]
– release a corpus of 50 password frequency lists representing
approximately 70 million Yahoo! users
– varies from 8 to 0.002
• Human Mobility [Mir et al. Big Data ’13 ]
– synthetic data to estimate commute patterns from call detail records
collected by AT&T
– 1 billion records ~ 250,000 phones
• Apple will use DP [Greenberg. Wired Magazine ’16]
– in iOS 10 to collect data to improve QuickType and emoji suggestions,
Spotlight deep link suggestions, and Lookup Hints in Notes
– in macOS Sierra to improve autocorrect suggestions and Lookup Hints

Module 5 Tutorial: Differential Privacy in the Wild 57


Module 5: Privacy in the real world
• Real world deployments of differential privacy

– OnTheMap RAPPOR

• Privacy beyond Tabular Data


– No Free Lunch Theorem
– Customizing differential privacy using Pufferfish

Module 5 Tutorial: Differential Privacy in the Wild 58


Differential Privacy & Complex Datatypes

• Defining neighboring databases


– What is a record?

• Records can be correlated


– Unravels privacy guarantee

Module 5 Tutorial: Differential Privacy in the Wild 59


Graphs
Neighboring databases … differ in one record.

• In graphs, a record can be:


– An edge (u,v)
– The adjacency list of node u

Module 5 Tutorial: Differential Privacy in the Wild 60


Trajectories
Neighboring databases … differ in one record.

• In location trajectories, a record can be:


– Each location in the trajectory
– A sequence of locations spanning a window of time
– The entire trajectory

Module 5 Tutorial: Differential Privacy in the Wild 61


US Census Bureau Data
• Employee • Job
– Age – Start date
– Sex – End date
– Race & Ethnicity – Worker & Workplace IDs
– Education – Earnings
– Home location (Census block)

• Employer
– Geography (Census blocks)
– Industry
– Ownership (Public vs Private)
Module 5 Tutorial: Differential Privacy in the Wild 62
US Census Bureau Data
Neighboring databases … differ in one record.
Existing LODES Data as Presented in the
Tool

• A record can be:


– An employee
– An employer
– Something else?
• Come to talk on Thursday
Employment in Lower Manhattan Residences o
Lower Manh
Available at http://onthemap.ces.census.gov/.
Samuel Haney The Cost of Provable Privacy: A Case Study on

Module 5 Tutorial: Differential Privacy in the Wild 63


Differential Privacy & Complex Datatypes

• Defining neighboring databases


– What is a record?

• Records can be correlated


– Unravels privacy guarantee

Module 5 Tutorial: Differential Privacy in the Wild 64


[KM11]

Correlations and DP

Bob Alice

• Want to release the number of edges between blue


and green communities.
• Should not disclose the presence/absence of Bob-
Alice edge.

Module 5 Tutorial: Differential Privacy in the Wild 65


[KM11]
Adversary knows how social networks evolve

Depending on the social network evolution model,


(d2-d1) is linear or even super-linear in the size of the network.

Module 5 Tutorial: Differential Privacy in the Wild 66


[KM11]

Differential privacy fails to avoid breach

Output (d1 + δ)

δ ~ Laplace(1/ε)

Output (d2 + δ)

Adversary can distinguish between the two


worlds if d2 – d1 is large.

Module 5 Tutorial: Differential Privacy in the Wild 67


[KM11]

Reason for Privacy Breach


• Pairs of tables that differ
in one tuple

• cannot distinguish them

Tables that do not


satisfy background
knowledge

Space of all
possible tables

Module 5 Tutorial: Differential Privacy in the Wild 68


[KM11]

Reason for Privacy Breach


can distinguish between
every pair of these tables based
on the output

Space of all
possible tables

Module 5 Tutorial: Differential Privacy in the Wild 69


No Free Lunch Theorem
It is not possible to guarantee any utility in addition
to privacy, without making assumptions about

• the data generating distribution


[KM11]

• the background knowledge available


to an adversary
[DN 10]

Module 5 Tutorial: Differential Privacy in the Wild 70


Need a formal theory to understand the
privacy ensured by DP

Module 5 Tutorial: Differential Privacy in the Wild 71


[KM12, KM14]

Pufferfish

Module 5 Tutorial: Differential Privacy in the Wild 72


Pufferfish Semantics
• What is being kept secret?

• Who are the adversaries?

• How is information disclosure bounded?


– (similar to epsilon in differential privacy)

Module 5 Tutorial: Differential Privacy in the Wild 73


Sensitive Information
• Secrets: S be a set of potentially sensitive statements
– “individual j’s record is in the data, and j has Cancer”
– “individual j’s record is not in the data”

• Discriminative Pairs:
Mutually exclusive pairs of secrets.
– (“Bob is in the table”, “Bob is not in the table”)
– (“Bob has cancer”, “Bob has diabetes”)

– Denotes an adversary’s possible beliefs about a target individual.

Module 5 Tutorial: Differential Privacy in the Wild 74


Adversaries
• We assume a Bayesian adversary who is can be completely
characterized by his/her prior information about the data
– We do not assume computational limits

• Data Evolution Scenarios: set of all probability distributions


that could have generated the data ( … think adversary’s prior).

– No assumptions: All probability distributions over data instances are


possible.

– I.I.D.: Set of all f such that: P(data = {r1, r2, …, rk}) = f(r1) x f(r2)
x…x f(rk)

Module 5 Tutorial: Differential Privacy in the Wild 75


Information Disclosure
• Mechanism M satisfies ε-Pufferfish(S, Spairs, D), if

Module 5 Tutorial: Differential Privacy in the Wild 76


Pufferfish Semantic Guarantee

Posterior odds Prior odds of


of s vs s’ s vs s’

Module 5 Tutorial: Differential Privacy in the Wild 77


Customizing Privacy
• Setup secrets and discriminative pairs based on
the requirements of what must be kept secret

• Set up data generating distributions to capture


correlations known to the adversary

• Pufferfish results in privacy definition that


bounds the adversary’s posterior and prior odds
for every discriminative pair.
Module 5 Tutorial: Differential Privacy in the Wild 78
Advantages
• Privacy defined more generally in terms of
customizable secrets rather than records
– Better capture legal privacy policies

• Can better explore privacy-utility tradeoff by


varying secrets and adversaries
– See application to US Census Bureau Data
(Thursday 2PM DP Session)

• Gives a deeper understanding of the protections


afforded by existing privacy definition
Module 5 Tutorial: Differential Privacy in the Wild 79
Pufferfish & Differential Privacy

• Discriminative Pairs:
– !!! !: record i takes the value x
– !!! !: record i is not in the database
– !!"#$% = !!! , !!! ∀! ∈ !"#, ∀!!"#$!%!! !

• Attackers should not be able to tell whether a


record is in or out of the database

Module 5 Tutorial: Differential Privacy in the Wild 80


Pufferfish & Differential Privacy

• Data evolution:
– For all θ = [ f1, f2, f3, …, fk ]

! !"#" = ! Θ] = ! !! (!! )!
!! ∈!

• Adversary’s prior may be any distribution that


makes records independent
Module 5 Tutorial: Differential Privacy in the Wild 81
Pufferfish & Differential Privacy
• Discriminative Pairs:
– !!"#$% = !!! , !!! ∀! ∈ !"#, ∀!!"#$!%!! !

• Data evolution:
– For all θ = [ f1, f2, f3, …, fk ]

! !"#" = ! Θ] = ! !! (!! )!
!! ∈!
A mechanism M satisfies differential privacy
if and only if
it satisfies Pufferfish instantiated using Spairs and {θ}

Module 5 Tutorial: Differential Privacy in the Wild 82


Challenges with Pufferfish
• Setting up data generating distributions are tricky
– Adversary’s knowledge is unknown

• Little work on algorithm design for Pufferfish


– Notable Exceptions: Blowfish (next module), and
Wasserstein mechanism (Thursday 2 PM DP Session)

• Not all Pufferfish definitions are “good”


– Many do not satisfy composition

Module 5 Tutorial: Differential Privacy in the Wild 83


Summary
• Complex datatypes require custom privacy definitions
– No Free Lunch theorem
– Varied notions of neighboring databases
– Correlations can unravel privacy ensured by DP algorithms

• Pufferfish is a mathematical framework for defining


privacy
– A rigorous way to customize privacy to applications
– Helps understand semantics of privacy definitions

Module 5 Tutorial: Differential Privacy in the Wild 84


MODULE 6:
APPLICATIONS II: NETWORK &
TRAJECTORIES
Module 6 Tutorial: Differential Privacy in the Wild 85
Module 6: Applications II
• Pufferfish Privacy for Non-tabular Data
Social network Location trajectories

• Blowfish Privacy

Module 6 Tutorial: Differential Privacy in the Wild 86


Pufferfish Semantics
• What is being kept secret?

• Who are the adversaries?

• How is information disclosure bounded?


– (similar to epsilon in differential privacy)

Module 6 Tutorial: Differential Privacy in the Wild 87


Examples: Graphs
• Neighboring graphs differ in presence/absence of one edge

• Pufferfish meaning:
– Data: matrix of bits
– Secrets: whether or not an edge (𝑢, 𝑣) is in the graph -- bit at
(𝑢, 𝑣) is 0 or 1
– Data generating distributions: All graphs where each edge 𝒆 is
independently present with probability 𝑝𝑒.

• But …
– Edges are not independent in real graphs

Module 6 Tutorial: Differential Privacy in the Wild 88


Examples: Location Trajectories
• Neighboring tables differ in one location (at one point of
time) of an individual

• Pufferfish meaning
– Data: a matrix of locations
– Secrets: Whether or not individual was at some location at some
point of time
– Data Generating Distributions: All trajectories where an
individual’s location at some time is independent of all other
locations …

• But …
– Current location depends on previous locations…

Module 6 Tutorial: Differential Privacy in the Wild 89


Common Themes
• What are secrets and neighboring datasets for different
applications?

• Correlations between protected objects requires further


redefinition of privacy
• New privacy definitions requires new algorithm design
• Many open questions

Module 6 Tutorial: Differential Privacy in the Wild 90


Social Network
• Represented using a graph 𝐺(𝑉, 𝐸)
– 𝑉: node set (individuals)
– 𝐸: edge set (social links)

Module 6 Tutorial: Differential Privacy in the Wild 91


Social Network
• Attacks on graph anonymization
“it is possible for an adversary to learn whether edges
exist or not between specific targeted pairs of nodes.”
[BDK07]

“a third of the users on both Twitter and Flickr, can


be re-identified in the anonymous Twitter graph
with only a 12% error rate.” [NS09]

Module 6 Tutorial: Differential Privacy in the Wild 92


Private Analysis of Social Network
Trusted Server

𝑓𝑝𝑟𝑖𝑣𝑎𝑡𝑒(𝐺, 𝜖)

Person 1 Person 2 Person 3 Person N


v1 v2 v3 vN

Differential Privacy: 𝑓𝑝𝑟𝑖𝑣𝑎𝑡𝑒 is 𝜖-differentially private if


for all neighbors 𝐺, 𝐺 b and output 𝑆:
𝑃𝑟 𝑓𝑝𝑟𝑖𝑣𝑎𝑡𝑒 𝐺, 𝜖 ∈ 𝑆 ≤ 𝑒 g Pr[𝑓𝑝𝑟𝑖𝑣𝑎𝑡𝑒(𝐺 b , 𝜖) ∈ 𝑆]

Module 6 Tutorial: Differential Privacy in the Wild 93


Variants of DP for Social Network
• Edge Differential Privacy
Secret: social links between individuals

Two graphs are neighbors if they differ in the


presence of one edge
Module 6 Tutorial: Differential Privacy in the Wild 94
Variants of DP for Social Network
• Node Differential Privacy
Secret: presence of an individual

Two graphs are neighbors if one can be obtained by


another by adding or removing a node and all its edges
Module 6 Tutorial: Differential Privacy in the Wild 95
Examples for Social Network Statistics
• Degree distribution 𝐷(𝐺)
• Number of edges
• Counts of small subgraphs
e.g triangles, 𝑘-triangles, 𝑘-stars, etc.
• Cut
• Distance to nearest graph with a certain
property
• Joint degree distribution
Module 6 Tutorial: Differential Privacy in the Wild 96
Examples for Social Network Statistics
• Degree distribution 𝐷(𝐺)

Degree 0 1 2 3 4 5
Frequency 0 5 0 0 0 1

𝐷(𝐺) = [0,5,0,0,0,1]

Module 6 Tutorial: Differential Privacy in the Wild 97


Global Sensitivity of Degree
Distribution
• What is the global sensitivity of the degree
distribution of 𝐺 𝑉, 𝐸 , where 𝑉 = 𝑛 under
Edge differential privacy?

Module 6 Tutorial: Differential Privacy in the Wild 98


Global Sensitivity of Degree
Distribution
• What is the global sensitivity of the degree
distribution of 𝐺 𝑉, 𝐸 , where 𝑉 = 𝑛 under
Edge differential privacy?

Answer: 4
Remove edge (𝑖, 𝑗), the changes in degree frequency
Degree … 𝑑𝑖 -1 𝑑𝑖 … 𝑑𝒋-1 𝑑𝒋 …
Frequency … +1 -1 … +1 -1 …

Module 6 Tutorial: Differential Privacy in the Wild 99


Global Sensitivity of Degree
Distribution (Exercise)
• What is the global sensitivity of the degree
distribution of 𝐺 𝑉, 𝐸 , where 𝑉 = 𝑛 under
Node differential privacy?

Module 6 Tutorial: Differential Privacy in the Wild 100


Global Sensitivity of Degree
Distribution (Exercise)
• What is the global sensitivity of the degree
distribution of 𝐺 𝑉, 𝐸 , where 𝑉 = 𝑛 under
Node differential privacy? Answer: 2n-1
Highly Sensitive!!à Too much noise

𝐷(𝐺) = [0,5,0,0,0,1] 𝐷(𝐺′) = [5,0,0,0,0,0]

Module 6 Tutorial: Differential Privacy in the Wild 101


Approach to Highly Sensitive Queries
Key idea:
• Projection 𝐺 on 𝜽-degree-bounded graphs 𝐺𝜃
• Answer queries on 𝐺𝜃 instead of 𝐺
𝐷ª𝐺 = 𝐷 𝐺𝜃 + 𝑛𝑜𝑖𝑠𝑒

• Existing approaches for degree distribution


– Node-based truncation [KNRS13]
– Lipschitz extensions [RS15]
– Edge-based projection [DLL16]

Module 6 Tutorial: Differential Privacy in the Wild 102


How much noise?
• Answer queries on 𝐺𝜃 instead of 𝐺
𝐷ª𝐺 = 𝐷 𝐺𝜃 + 𝑛𝑜𝑖𝑠𝑒
• Sensitivity
– Node-based truncation: 2𝜃 ⋅ 𝛿
• Smooth sensitivity approach Applicable to count
[NRS07] - edges
– Lipschitz extensions: 6𝜃 - small subgraph
[KNRS13]
– Edge-based projection: 2𝜃 + 1

Module 6 Tutorial: Differential Privacy in the Wild 103


Work on Edge DP
• Degree distribution
– Global sensitivity + Post-processing [HLM09, HRMS10,
KS12, LK13]
• Small subgraph counting
– Smooth sensitivity [NR07]
– Ladder function [ZCPSX15]
– Noisy sensitivity [DL09]
• Cut
– Random projections, global sensitivity [BBDS1212]
– Iterative updates [HR10, GRU12]
• Releasing differentially private graph
– Exponential random graphs [LM14, KSK15]

Module 6 Tutorial: Differential Privacy in the Wild 104


Outline of Module 6
• Pufferfish Privacy for Non-tabular Data
Social network Location trajectory

• Blowfish Privacy

Module 6 Tutorial: Differential Privacy in the Wild 105


Location Trajectory

High uniqueness & High predictability


[MHVB13] [SQBB10]

‘show me how you move and I will tell you who you are’
[GKC10]

‘geosocial service “check in” dropped from 18% to 12%’


in the Pew Research Center’s Internet Project, 2013

Module 6 Tutorial: Differential Privacy in the Wild 106


Rich Domain for Secrets
Neighbors differ What to hide?
in
Trajectory All properties of the individual are secret
e.g. Where is home location?
Window Properties within a small window
e.g. Did user visit home in the last hour?
Event Properties at a specific time
e.g. Was user at work or at home at time t?
Geo- Some properties (not all) at a specific time
indistinguishability e.g. Did user visit near home at time t ?

Module 6 Tutorial: Differential Privacy in the Wild 107


Rich Domain for Secrets
Neighbors differ What to hide?
in
Trajectory All properties of the individual are secret
e.g. Where is home location?
Window Properties within a small window
e.g. Did user visit home in the last hour?
Event Properties at a specific time
e.g. Was user at work or at home at time t?
Geo- Some properties (not all) at a specific time
indistinguishability e.g. Did user visit near home at time t ?

Module 6 Tutorial: Differential Privacy in the Wild 108


Rich Domain for Secrets
Neighbors differ What to hide?
in
Trajectory All properties of the individual are secret
e.g. Where is home location?
Window Properties within a small window
e.g. Did user visit home in the last hour?
Event Properties at a specific time
e.g. Was user at work or at home at time t?
Geo- Some properties (not all) at a specific time
indistinguishability e.g. Did user visit near home at time t ?

Module 6 Tutorial: Differential Privacy in the Wild 109


Rich Domain for Secrets
Neighbors differ What to hide?
in
Trajectory All properties of the individual are secret
e.g. Where is home location?
Window Properties within a small window
e.g. Did user visit home in the last hour?
Event Properties at a specific time
e.g. Was user at work or at home at time t?
Geo- Some properties (not all) at a specific time
indistinguishability e.g. Did user visit near home at time t ?

Module 6 Tutorial: Differential Privacy in the Wild 110


Overview of Privacy Definitions
Neighbors differ What to hide?
in
Trajectory All properties of the individual are secret
e.g. Where is home location?
Window Properties within a small window
e.g. Did user visit home in the last hour?
Event Properties at a specific time
e.g. Was user at work or at home at time t?
Geo- Some properties (not all) at a specific time
indistinguishability e.g. Did user visit near home at time t ?

Module 6 Tutorial: Differential Privacy in the Wild 111


Protect a Single Location
• Protect a single location
𝑥′
e.g. Location-based Services (LBS) 𝑟
to find a restaurant 𝑥
• Not reveal the exact location
• Revealing an approximate
location is ok

• A mechanism satisfies 𝝐-geo-indistinguishability iff for all


observations 𝑆 ⊆ 𝑍, for all 𝑟 > 0 , for all neighbors 𝑥, 𝑥′ ∶
𝑑(𝑥, 𝑥′) ≤ 𝑟, [ABCP13]
Pr 𝑆 𝑥 ≤ 𝑒 g² 𝑃𝑟 [𝑆|𝑥 b ]

Module 6 Tutorial: Differential Privacy in the Wild 112


Different Levels of Protection
Total budget: 𝜖𝑡
• Event level DP

Privacy Budget 𝜖 𝜖 𝜖 …. 𝜖 𝜖

Released locations z1 z2 z3 …. zt-1 zt


True locations x1 x2 x3 …. xt-1 xt
1 2 3 …. t-1 t time

If a person staying at a location for a long time 𝑥1 = 𝑥2 =


⋯ = 𝑥´ , averaging (𝑧1 , . . , 𝑧𝑤) leaks the true location.

Module 6 Tutorial: Differential Privacy in the Wild 113


Different Levels of Protection
[KPXP14]

• 𝑤-event DP
– Neighboring stream prefix 𝑥B, 𝑥U, . . , 𝑥x , 𝑥Bb, 𝑥Ub. . , 𝑥xb
• For any 𝑖 < 𝑗, if 𝑥> ≠ 𝑥>b , and 𝑥¶ ≠ 𝑥¶b
then 𝑗 − 𝑖 + 1 ≤ 𝑤
• 𝑥> and 𝑥>b are the same or neighboring
àProtect updates happening within 𝑤-event with privacy
budget 𝜖

x1 x2 xi …. xj …. xt
x'1 x’2 x’i …. x’j …. x’t
≤𝑤 time
Module 6 Tutorial: Differential Privacy in the Wild 114
Different Levels of Protection
[KPXP14]

• 𝑤-event DP
– E.g. 𝑤=3
𝜖

Released locations z1 z2 z3 z4 …. zt-1 zt


True locations x1 x2 x3 x4 …. xt-1 xt
1 2 3 …. t-1 t time

Module 6 Tutorial: Differential Privacy in the Wild 115


Different Levels of Protection
[KPXP14]

• 𝑤-event DP
– E.g. 𝑤=3
𝜖

Released locations z1 z2 z3 z4 …. zt-1 zt


True locations x1 x2 x3 x4 …. xt-1 xt
1 2 3 …. t-1 t time

Module 6 Tutorial: Differential Privacy in the Wild 116


Different Levels of Protection
[KPXP14]

• 𝑤-event DP
– Allow budget allocation strategy:
• Adaptive assign privacy budgets to events within the same
𝑤-window
• E.g. 𝑤=3 𝜖
g g g g
“ · U ·
Released locations z1 z2 z3 z4 …. zt-1 zt
True locations x1 x2 x3 x4 …. xt-1 xt
1 2 3 …. t-1 t time

Module 6 Tutorial: Differential Privacy in the Wild 117


Different Levels of Protection
[KPXP14]

• 𝑤-event DP
– Allow budget allocation strategy:
• Adaptive assign privacy budgets to events within the same
𝑤-window
5
• E.g. 𝑤=3 𝜖≤𝜖
6
g g g g
“ · U ·
Released locations z1 z2 z3 z4 …. zt-1 zt
True locations x1 x2 x3 x4 …. xt-1 xt
1 2 3 …. t-1 t time

Module 6 Tutorial: Differential Privacy in the Wild 118


Different Levels of Protection
• Trajectory-level DP for entire trajectory
– Neighboring databases 𝐷1, 𝐷2
• Differ in one entire trajectory
– Release aggregate statistics for multiple trajectories
[CAC12, HCMPS15]
Privacy Budget 𝜖

Released locations z1 z2 z3 …. zt-1 zt


True locations x1 x2 x3 …. xt-1 xt
1 2 3 …. t-1 t time

Module 6 Tutorial: Differential Privacy in the Wild 119


Different Level of Privacy Protection
• Pufferfish Privacy for Non-tabular Data
Social network Location trajectory

• Edge DP • 𝜖-indistinguishability
• Node DP • Event DP
• 𝑤-event DP
• Trajectory level DP

Module 6 Tutorial: Differential Privacy in the Wild 120


Outline of Module 6
• Pufferfish Privacy for Non-tabular Data
Social network Location trajectory

• Blowfish Privacy

Module 6 Tutorial: Differential Privacy in the Wild 121


Blowfish Privacy [HMD14]

• Special case of Pufferfish that satisfies sequential


composition

• A framework for redefining neighboring databases for


complex datatypes using a policy graph
– Captures many neighboring definitions
– Handles correlations induced by constraints on database
• Prior data releases
• Location constraints

Module 6 Tutorial: Differential Privacy in the Wild 122


Blowfish
• Differential Privacy:
For all outputs 𝑜, for all |𝐷B − 𝐷U | = 1,
Pr [𝐴(𝐷B ) = 𝑜] ≤ 𝑒 g Pr [𝐴(𝐷U ) = 𝑜]
Redefined
neighbor
• Blowfish Privacy: relation
For all outputs 𝑜, for all 𝐷B , 𝐷U ∈ 𝑵𝑮
Pr [𝐴(𝐷B ) = 𝑜] ≤ 𝑒 g Pr [𝐴(𝐷U ) = 𝑜]

Module 6 Tutorial: Differential Privacy in the Wild 123


Algorithm Design Simplified
[HMD16]

• Transformational equivalence between Blowfish and


differential privacy

• No need to do algorithm design from scratch for each


definition

• Answering queries under a Blowfish privacy policy is


equivalent in error to answering transformed queries
under differential privacy

Module 6 Tutorial: Differential Privacy in the Wild 124


Intuition
For all outputs 𝑜, for all |𝐷B − 𝐷U | = 1,
Pr [𝐴(𝐷B ) = 𝑜] ≤ 𝑒 g Pr [𝐴(𝐷U ) = 𝑜]

Is equivalent to

For all outputs 𝑜, for all 𝐷B , 𝐷U


Pr [𝐴(𝐷B ) = 𝑜] ≤ 𝑒 g¼(‡o ,‡½ ) Pr [𝐴(𝐷U ) = 𝑜]
where Δ(𝐷B , 𝐷U ) is the size of symmetric difference

Module 6 Tutorial: Differential Privacy in the Wild 125


Definitions differ in distance metrics
• Differential Privacy:
For all outputs 𝑜, for all 𝐷B , 𝐷U
Pr [𝐴(𝐷B ) = 𝑜] ≤ 𝑒 g¼(‡o ,‡½ ) Pr [𝐴(𝐷U ) = 𝑜]
Distance metric
• Blowfish Privacy: imposed by
neighbor relation
For all outputs 𝑜, for all 𝐷B , 𝐷U
Pr [𝐴(𝐷B ) = 𝑜] ≤ 𝑒 g¿(‡o ,‡½ ) Pr [𝐴(𝐷U ) = 𝑜]

Module 6 Tutorial: Differential Privacy in the Wild 126


Transformational equivalence …
… achieved by embedding distance imposed by
neighbor definition in Blowfish to distance metric
imposed by neighbors that differ in one record.

Module 6 Tutorial: Differential Privacy in the Wild 127


Extending Differential Privacy via
Metrics
• [CEBP13] propose generalizations of differential
privacy using metrics
– Special case of Pufferfish and generalizes Blowfish

• [WSC17] use a similar intuition to derive a generalized


sensitivity notion for using Laplace mechanism for
Pufferfish Thur 2pm DP Session
– Based on Wasserstein distances
– Computing this generalized sensitivity can be intractable
– Examples of intractability also shown in [KM11, HMD14]

Module 6 Tutorial: Differential Privacy in the Wild 128


Module 6: Applications II
• Pufferfish Privacy for Non-tabular Data
Social network Location trajectories

• Blowfish Privacy

Module 6 Tutorial: Differential Privacy in the Wild 129


Open Questions
• Identify realistic policies for real world applications.
– Is it socially acceptable to offer weaker privacy protection to
high-degree nodes?
• Algorithm design under complicated constraints or
correlations.
– Correlations within both trajectories and between users, e.g.
family members may share similar trajectories patterns.
– Highly sensitive queries under constraints or correlations.
• Privacy analysis across different guarantees.

Module 6 Tutorial: Differential Privacy in the Wild 130


MODULE 7:
SUMMARY

Tutorial: Differential Privacy in the Wild 131


Module 7: Summary
• Recap of tutorial
• Five Cross-cutting ideas
• Challenges

Tutorial: Differential Privacy in the Wild 132


Statistical database privacy
• Statistical database privacy is the problem of
releasing aggregates while not disclosing individual
records
• Privacy desiderata
– Resilience to background knowledge
– Composition
– Avoid privacy by obscurity: public
algorithms/implementations
• Utility desiderata
– Accurate
– Useful

Tutorial: Differential Privacy in the Wild 133


Tutorial Summary
• Applications
– Query answering
– Machine learning
– Analysis of network data
– Trajectories
• Real-world deployments:
– U.S. Census Bureau OnTheMap: commuting patterns
– Google RAPPOR: browser settings
• Formal privacy definitions
– Differential privacy, Pufferfish, Blowfish

Tutorial: Differential Privacy in the Wild 134


Cross-cutting ideas
1. Higher accuracy through careful composition
– Parallel composition, advanced composition

2. Where to inject noise?


– On input, output, intermediate result
– Find “information bottleneck” that has tight bound
on sensitivity
– May be dictated by application (e.g., RAPPOR)

Tutorial: Differential Privacy in the Wild 135


Cross-cutting ideas
3. Lossy transformations
– Histograms: adaptive binning
– RAPPOR: bloom filters
– Social networks: degree-bounded graphs
– … results in bias-variance tradeoffs

4. Leverage domain knowledge


– OnTheMap: previously published data
– RAPPOR: heavy hitters
– Network data: tends to be sparse
– Histograms: smooth, sparse

Tutorial: Differential Privacy in the Wild 136


Cross-cutting ideas
5. Privacy definition may be application specific
– Differential privacy is a rigorous definition that
protects individual tuples…
– … but this may not align with semantics of
application
– In your application…
• What are the secrets?
• Who are the adversaries? What data correlations can
they exploit?

Tutorial: Differential Privacy in the Wild 137


Challenge 1: From Prototypes to
Deployments
• Community needs more examples of real-world
deployments

[Erlingsson et al, CCS 2014]

• Demonstrate usefulness in real applications


• These raise important research problems
– Hardening against side-channel attacks [M12]
– Matching formal privacy guarantee to needs of
application
Tutorial: Differential Privacy in the Wild 138
Challenge 2: From Algorithms to
Systems
• Today, getting DP to work in practice requires a
team of experts

• … resembles early days of database research…

Tutorial: Differential Privacy in the Wild 139


… without exception ad hoc,
cumbersome, and difficult to
use – they could really only be
used by people having highly
specialized technical skills …

E. F. Codd on the state of


databases in early 1970s

Tutorial: Differential Privacy in the Wild 140


Challenge 2: From Algorithms to
Systems
• Today, getting DP to work in practice requires a team of
experts

• Example of systems work: Privacy Integrated Queries


(PINQ) [M10]
– Guarantees that programs satisfy privacy…
– … but program author responsible for accuracy

• Need more research on systems…


– Modular components
– Automatic optimization

Tutorial: Differential Privacy in the Wild 141


Challenge 3: Communicating privacy-
utility tradeoffs
• Inherent tradeoff
between utility
SIGMOD 2016
and privacy
• Must be
communicated to
stakeholders
• Need for tunable
algorithms

Tutorial: Differential Privacy in the Wild 142


Thank you!
Ashwin Machanavajjhala
Assistant Professor, Duke University

“What does privacy mean … mathematically?”

Michael Hay
Assistant Professor, Colgate University

“Can algorithms be provably private and useful?”

Xi He
Ph.D. Candidate, Duke University

“Can privacy algorithms work in real world systems?”


Tutorial: Differential Privacy in the Wild 143
Module 4 References
• [KLNR11] Kasiviswanathan, Lee, Nissim, Raskhodnikova, “What Can We Learn
Privately?”, SIAM Journal of Computing, 2011
• [CMS11] Chaudhuri, Monteleoni, Sarwate, “Differentially Private Empirical Risk
Minimization”, MLRJ 2011
• [CSC13] Song, Chaudhuri, Sarwate, “Stochastic gradient descent with differentially
private updates”, GlobalSIP 2013
• [BST14] Bassily, Smith, Thakurta, “Private Empirical Risk Minimization, Revisited”,
FOCS 2014
• [SS15] Shokri, Shmatikov, “Privacy-Preserving Deep Learning”. CCS 2015
• [JKT12] Jain, Kothari, Thakurta , “Differentially Private Online Learning”, COLT,
2012
• [ACG16] Abadi, Chu, Goodfellow, McMahan, Mironov, Talwar, Zhang, “Deep
Learning with Differential Privacy”, SIGSAC 2016
• [WLK17] Wu, Li, Kumar, Chaudhuri, Jha, Naughton, “Bolt-on Differential Privacy for
Scalable Stochastic Gradient Descent-based Analytics”, SIGMOD 2017
• [DRV10] Dwork, Rothblum, Vadhan. Boosting and differential privacy, FOCS 2010

Module 4 Tutorial: Differential Privacy in the Wild 144


Module 5 References
• [DN10] Dwork, Naor “On the Difficulties of Disclosure Protection in
Statistical Databases or the Case for Differential Privacy”, JPC 2010
• [KM11] Kifer, Machanavajjhala, “No Free Lunch in Data Privacy”, SIGMOD
2011
• [KM12] Kifer, Machanavajjhala, “A rigorous and customization framework
for privacy”, PODS 2012
• [KM14] Kifer, Machanavajjhala, “Pufferfish: A framework for mathematical
privacy definitions”, TODS 2014
• [HMD14] He, Machanavajjhala, Ding, “Blowfish Privacy” SIGMOD 2014
• [HMD16] Haney, Machanavajjhala, Ding, “Design of Policy Aware
Differentially Private Algorithms”, VLDB 2016
• [CEBP13] Chatzikokolakis, Andres, Bordenabe, Palamidessi, “Broadening the
scope of differential privacy using metrics”, PoPETS 2013
• [WSC16] Wang, Song, Chaudhuri, “Privacy Preserving analysis of Correlated
Data”, Corr abs/1603.03977, 2016

Module 5 Tutorial: Differential Privacy in the Wild 145


Module 6 References
• [BDK07] Backstrom, Dwork, Kleinberg, “Wherefore Art Thou R3579X? Anonymized Social Networks,
Hidden Patterns, and Structural Steganography”, WWW 2007
• [NS09] Narayanan, Shmatikov, “De-anonymizing Social Networks.” IEEE Security & Privacy 2009
• [KNRS13] Kasiviswanathan, Nissim, Raskhodnikova, Smith, “Analyzing Graphs with Node Differential
Privacy”, TCC 2013
• [NRS07] Nissim, Raskhodnikova, Smith. “Smooth Sensitivity and Sampling in Private Data Analysis”, STOC
2007
• [RS15] Raskhodnikova, Smith, “Efficient Lipschitz Extensions for High-Dimensional Graph Statistics and
Node Private Degree Distributions”, arXiv:1504.07912v1, 2015
• [DLL16] Day, Li, Lyu. “Publishing Graph Degree Distribution with Node Differential Privacy”, SIGMOD
2016
• [NRS07] Nissim, Raskhodnikova, Smith. “Smooth sensitivity and sampling in private data analysis”, STOC
2007.
• [KRSY09] Karwa, Raskhodnikova, Smith, Yaroslavtsev, “Private Analysis of Graph Structure”, VLDB 2011
• [DL09] Dwork, Lei. “Differential privacy and robust statistics”, STOC 2009
• [HLMJ09] Hay, Li, Miklau, Jensen, “Accurate estimation of the degree distribution of private networks”,
ICDM 2009.
• [HRMS10] Hay, Rastogi, Miklau, Suciu. “Boosting the accuracy of differentially-private queries through
consistency”, PVLDB 2010.

Module 6 Tutorial: Differential Privacy in the Wild 146


Module 6 References
• [KS12] Karwa, Slavkovic. “Differentially Private Graphical Degree Sequences and Synthetic Graphs”, Privacy
in Statistical Databases 2012: 273-285
• [LK13] Lin, Kifer. “Information preservation in statistical privacy and bayesian estimation of unattributed
histograms”. SIGMOD 2013
• [GRU12] Gupta, Roth, Ullman, “Iterative Constructions and Private Data Release”, TCC 2012
• [ZCPSX15] Zhang, Cormode, Procopiuc, Srivastava and Xiao, “Private Release of Graph Statistics using
Ladder Functions”, SIGMOD 2015
• [KSK15] Karwa, Slavkovi´c, Krivitsky. “Differentially Private Exponential Random Graphs”,
arXiv:1409.4696v2 2015
• [LM14] Lu. Miklau, “Exponential random graph estimation under differential privacy”, KDD 2014
• [KKS15] Karwa, Krivitsky, Slavković, “Sharing Social Network Data: Differentially Private Estimation of
Exponential-Family Random Graph Models”, arXiv:1511.02930v1. 2015
• [BBDS12] Blocki, Blum, Datta, Sheffet, “The Johnson-Lindenstrauss Transform Itself Preserves Differential
Privacy”, FOCS 2012
• [HR10] Hardt, Rothblum, “ A Multiplicative Weights Mechanism for Privacy-Preserving Data
Analysis”, FOCS 2010
• [MHVB13] Montjoye, Hidalgo, Verleysen, Blondel, “ Unique in the crowd: The privacy bounds of human
mobility”. Sci. Rep., 3(1376), 2013.
• [SQBB10] Song, Qu, Blumm, Barabsi. “ Limits of predictability in human mobility”, Science, 2010

Module 6 Tutorial: Differential Privacy in the Wild 147


Module 6 References
• [GKC10] Gambs, Killijian, Cortez, “Show Me How You Move and I Will Tell You Who You Are”, SPRINGL
2010
• [ABCP13] Andrés, Bordenabe, Chatzikokolakis, Palamidessi, “Geo-Indistinguishability: Differential Privacy for
Location-Based Systems”, CCS 2013
• [XX15] Xiao, Xiong. “ Protecting Locations with Differential Privacy under Temporal Correlations”. CCS 2015
• [KPXP] Kellaris, Papadopoulos, Xiao, Papadias, “Differentially Private Event Sequences over Infinite Streams”,
VLDB 2014
• [HCMPS15] He, Cormode, Machanavajjhala, Procopiuc, Srivastava, “DPT: Differentially private trajectory
synthesis using hierarchical reference systems”, VLDB 2015
• [CAC12] Chen, Acs, Castelluccia, “Differentially private sequential data publication via variable-length n-
grams”, CCS 2012.
• [HMD14] He, Machanavajjhala, Ding, “Blowfish Privacy” SIGMOD 2014
• [HMD16] Haney, Machanavajjhala, Ding, “Design of Policy Aware Differentially Private Algorithms”, VLDB
2016
• [CEBP13] Chatzikokolakis, Andres, Bordenabe, Palamidessi, “Broadening the scope of differential privacy
using metrics”, PoPETS 2013
• [WSC17] Wang, Song, Chaudhuri, “Privacy Preserving analysis of Correlated Data”, SIGMOD, 2017
• [CYXX17] ]Yang Cao, Masatoshi Yoshikawa, Yonghui Xiao, Li Xiong, “Quantifying Differential Privacy under
Temporal Correlations”, ICDE 2017
• [KM11] Kifer, Machanavajjhala, “No Free Lunch in Data Privacy”, SIGMOD 2011
Module 6 Tutorial: Differential Privacy in the Wild 148

You might also like