Differential Privacy in The Wild: A Tutorial On Current Practices and Open Challenges

Differential Privacy in the Wild
A Tutorial on Current Practices and Open Challenges

About the Presenters
Ashwin Machanavajjhala
Assistant Professor, Duke University
“What does privacy mean … mathematically?”
Michael Hay
Assistant Professor, Colgate University
“Can algorithms be provably private and useful?”
Xi He
Ph.D. Candidate, Duke University
“Can privacy algorithms work in real world systems?”

Tutorial: Differential Privacy in the Wild 2
Our world is increasingly data driven
Source (http://www.agencypja.com/site/assets/files/1826/marketingdata-1.jpg)

Aggregated Personal Data is invaluable
Advertising
Source (esri.com)
Genome Wide
Association Studies
Human Mobility
analysis
Personal data is … well … personal!
Redlining
Age
Income
Address
Likes/Dislikes Discrimination
Sexual Orientation
Medical History
Source (time.com)
Physical/Financial
Harm

Aggregated Personal Data …
… is made publicly available in many forms.
De-identified records Statistics Predictive models

(e.g., medical) (e.g., demographic) (e.g., advertising)

That’s fine … I am anonymous!
Source (http://xkcd.org/834/)

Anonymity is not enough …

… and predictive models can breach
privacy too

Need data analysis algorithms that can
mine aggregated personal data with
provable guarantees of privacy for
individuals.
This is the goal of Differential Privacy.

Outline of the Tutorial
1. What is Privacy?
2. Differential Privacy
3. Answering Queries on Tabular Data
Break
4. Applications I: Machine Learning
5. Privacy in the Real World
6. Applications II: Networks and Trajectories

Module 1: What is Privacy?
• Privacy Problem Statement
• What privacy is not …

– Encryption
– Anonymization
– Restricted Query Answering
• What is privacy?
Module 2: Differential Privacy
• Differential Privacy Definition
• Basic Algorithms
– Laplace & Exponential Mechanism
– Randomized Response
• Composition Theorems

Module 3: Answering queries on
Tabular data
• Answering query workloads on tabular databases
• Theory: two seminal results
• Survey of algorithm design ideas

– Low dimensional range queries
– Queries on high dimensional data
• Open Questions
Module 4: Applications I
• Private Empirical Risk Minimization
– E.g. SVM, logistic regression
– Make a specific learning approach private
• Private Stochastic Gradient Descent

– E.g. Deep learning
– Make a general purposed fitting technique private
• Other Important Problems in Private Learning

Module 5: Privacy in the Real World
• Real world deployments of differential privacy
– OnTheMap RAPPOR
• Privacy beyond Tabular Data

– No Free Lunch Theorem
– Customizing differential privacy using Pufferfish

Module 6: Applications II
• Pufferfish Privacy for Non-tabular Data
Social network Location trajectories
• Blowfish Privacy

Scope of the Tutorial
What we do not cover:
• Securing data using encryption
• Computation on encrypted data
• Computationally bounded DP
• De-anonymization
• Anonymization schemes (k-anonymity, l-diversity, etc.)
• Access control

MODULE 1:
PROBLEM FORMULATION
Module 1 Tutorial: Differential Privacy in the Wild 19

Module 1: What is Privacy?
• Privacy Problem Statement
• What privacy is not …

– Encryption
– Anonymization
– Restricted Query Answering
• What is privacy?
Statistical Databases
Individuals with Person 1 Person 2 Person 3 Person N
sensitive data r1 r2 r3 rN
Hospital Census Google

Data Collectors
D D D
B B B
Economists
Data Analysts Medical Information

Doctors
Researchers Recommen- Retrieval
dation Researchers
Algorithms
[S 02]
The Massachusetts Governor
Privacy Breach
•Name
•SSN • Zip
•Visit Date
• Birth
•Diagnosis
date
•Procedure
•Medication • Sex
•Total Charge
Medical Data
Release

[S 02]
The Massachusetts Governor
Privacy Breach
•Name •Name
•SSN • Zip •Address
•Visit Date •Date
• Birth Registered
•Diagnosis
date
•Procedure •Party
•Medication • Sex affiliation
•Total Charge •Date last
voted
Medical Data Voter List

Release

[S 02]
Linkage Attack
•Name •Name • Governor of MA
uniquely identified
•Visit Date •Date
•Diagnosis • Birth Registered using ZipCode,
date Birth Date, and Sex.
•Total Charge •Date last Name linked to
voted Diagnosis
Release

[S 02]
Linkage Attack
•Name •Name 87 % of US
• Governor of population
MA
•Date
uniquely identified
•Visit Date
•Diagnosis • Birth Registered using ZipCode,
date Birth Date, and Sex.
•Total Charge •Date last
voted Quasi
Identifier
Release

Statistical Database Privacy
Function provided
by the analyst
Server
Output can disclose
sensitive information DB
about individuals
Person 1 Person 2 Person 3 Person N

r1 r2 r3 rN

Privacy for individuals
(controlled by a parameter ε)
Server
!!"#$%&' (!", !)!!
DB

r1 r2 r3 rN

Utility for analyst
Server
!!"#$%&' (!", !)!!
DB

r1 r2 r3 rN

(untrusted collector)
Server wants to
compute f Server
f ( DB )
Individuals do not want
server to infer their
records

r1 r2 r3 rN

(untrusted collector)
Perturb records to
ensure privacy for Server
individuals and
Utility for server f ( DB* )

r1 r2 r3 rN

Statistical Databases in real-world applications
Application Data Private Analyst Function (utility)

Collector Information
Medical Hospital Disease Epidemiologist Correlation between
disease and geography
Genome Hospital Genome Statistician/ Correlation between
analysis Researcher genome and disease
Advertising Google/FB/Y! Clicks/Browsi Advertiser Number of clicks on
ng an ad by
age/region/gender …
Social Facebook Friend links / Another user Recommend other
Recommen- profile users or ads to users
dations based on social
network

Statistical Databases in real-world applications
• Settings where data collector may not be

trusted (or may not want the liability …)
Application Data Collector Private Function (utility)
Information
Location Verizon/AT&T Location Traffic prediction
Services
Recommen- Amazon/Google Purchase Recommendation
dations history model
Traffic Internet Service Browsing Traffic pattern of
Shaping Provider history groups of users

Privacy is not …

Statistical Database Privacy is not …
• Encryption:

• Encryption:
Alice sends a message to Bob such that Trudy
(attacker) does not learn the message. Bob should
get the correct message …
• Statistical Database Privacy:

Bob (attacker) can access a database
- Bob must learn aggregate statistics, but
- Bob must not learn new information about
individuals in database.
• Computation on Encrypted Data:

• Computation on Encrypted Data:
- Alice stores encrypted data on a server
controlled by Bob (attacker).
- Server returns correct query answers to Alice,
without Bob learning anything about the data.

- Bob is allowed to learn aggregate properties of
the database.

• The Millionaires Problem:

• Secure Multiparty Computation:
- A set of agents each having a private input xi …
- … Want to compute a function f(x1, x2, …, xk)
- Each agent can learn the true answer, but must
learn no other information than what can be
inferred from their private input and the answer.

- Function output must not disclose individual inputs.
• Access Control:

• Access Control:
- A set of agents want to access a set of resources (could
be files or records in a database)
- Access control rules specify who is allowed to access (or
not access) certain resources.
- ‘Not access’ usually means no information must be
disclosed
• Statistical Database:
- A single database and a single agent
- Want to release aggregate statistics about a set of
records without allowing access to individual records

Privacy Problems
• In todays cloud context a number of privacy problems arise:
– Encryption when communicating data across a unsecure channel
– Secure Multiparty Computation when different parties want to
compute on a function on their private data without using a
centralized third party
– Computing on encrypted data when one wants to use an unsecure
cloud for computation
– Access control when different users own different parts of the data

Quantifying (and bounding) the amount of information
disclosed about individual records by the output of a valid
computation.

What is privacy?

Privacy Breach: Informal Definition
A privacy mechanism M(D)
that allows
an unauthorized party
to learn sensitive information about any individual in D,
which could not have learnt without access to M(D).

Alice
Alice has
Cancer
Is this a privacy breach? NO

Privacy Breach: Revised Definition
A privacy mechanism M(D) that allows
an unauthorized party
to learn sensitive information about
any individual Alice in D,
which could not have learnt without access to M(D)

if Alice was not in the dataset.

[S 02]
K-Anonymity: Avoiding Linkage Attacks

• If every row corresponds to one individual …
… every row should look like k-1 other rows

based on the quasi-identifier attributes

K-Anonymity
Zip Age Nationality Disease Zip Age Nationality Disease
13053 28 Russian Heart 130** <30 * Heart
13068 29 American Heart 130** <30 * Heart
13068 21 Japanese Flu 130** <30 * Flu
13053 23 American Flu 130** <30 * Flu
14853 50 Indian Cancer 1485* >40 * Cancer
14853 55 Russian Heart 1485* >40 * Heart
14850 47 American Flu 1485* >40 * Flu
14850 59 American Flu 1485* >40 * Flu
13053 31 American Cancer 130** 30-40 * Cancer
13053 37 Indian Cancer 130** 30-40 * Cancer
13068 36 Japanese Cancer 130** 30-40 * Cancer
13068 32 American Cancer 130** 30-40 * Cancer
48
[MKGV 06]
Problem: Background knowledge

Adversary knows Zip Age Nationality Disease
prior knowledge 130** <30 * Heart
about Umeko 130**
130**
<30
<30
*
*
Heart
Cancer
130** <30 * Cancer
Adversary learns 1485*
1485*
>40
>40
*
*
Cancer
Heart
Umeko has Cancer 1485* >40 * Flu
1485* >40 * Flu

Name Zip Age Nat. 130** 30-40 * Cancer
Umeko 13053 25 Japan 130** 30-40 * Cancer
130** 30-40 * Cancer
130** 30-40 * Cancer
49
A privacy mechanism must be able to
protect individuals’ privacy from
attackers who may possess
background knowledge

Healthcare Cost and Utilization Project
#Hospital discharges in NJ of ovarian cancer
patients, 2009 Counts less than k are
suppressed achieving k-
anonymity
Age #disc White Black Hispani Asian/ Native Other Missing
harge c Pcf American
s Hlnder
#dischar 735 535 82 58 18 * 19 22
ges
1-17 * * * * * * * *
18-44 70 40 13 * * * * *
45-64 330 236 31 32 * * 11 *
65-84 298 229 35 13 * * * *
85+ 34 29 * * * * * *
patients, 2009

s Hlnder
#dischar 735 535 82 58 18 1 19 22
ges
1-17 3 1 * * * * * *
18-44 70 40 13 * = 535* –
* * *
(40+236+229+29)
45-64 330 236 31 32 * * 11 *
65-84 298 229 35 13 * * * *
85+ 34 29 * * * * * *
patients, 2009

s Hlnder
#dischar 735 535 82 58 18 1 19 22
ges
1-17 3 1 [0-2] [0-2] [0-2] [0-2] [0-2] [0-2]
18-44 70 40 13 * * * * *
45-64 330 236 31 32 * * 11 *
65-84 298 229 35 13 * * * *
85+ 34 29 * * * * * *
patients, 2009

s Hlnder
#dischar 735 535 82 58 18 1 19 22
ges
1-17 3 1 [0-2] [0-2] [0-2] [0-2] [0-2] [0-2]
18-44 70 40 13 * * * * *
45-64 330 236 31 32 * * 11 *
65-84 298 229 35 13 * * * *
85+ 34 29 [1-3] * * * * *
[VSJO 13]
Can reconstruct tight bounds on rest of data
Age #disch White Black Hispanic Asian/ Native Other Missing

arges Pcf American
Hlnder
#dischar 735 535 82 58 18 1 19 22
ges
1-17 3 1 [0-2] [0-2] [0-1] [0] [0-1] [0-1]
18-44 70 40 13 [9-10] [0-6] [0] [0-6] [1-8]
45-64 330 236 31 32 [10] [0] 11 [10]
65-84 298 229 35 13 [2-8] [1] [2-8] [4-10]
85+ 34 29 [1-3] [1-4] [0-1] [0] [0-1] [0-1]

Multiple Release problem
• Privacy preserving access to data must
necessarily release some information about
individual records (to ensure utility)
• However, k-anonymous algorithms can reveal

individual level information even with two
releases.

A privacy mechanism must satisfy
composition …
… or allow a graceful degradation of privacy with
multiple invocations on the same data.
[DN03, GKS08]

Postprocessing the output of a privacy
mechanism must not change the privacy
guarantee
[KL10, MK15]

Privacy must not be achieved through
obscurity.
Attacker must be assumed to know the algorithm
used as well as all parameters

Summary
• Statistical database privacy is the problem of
releasing aggregates while not disclosing individual
records
• The problem is distinct from encryption, secure

computation and access control.
• Defining privacy is non-trivial

– Desiderata include resilience to background knowledge
and composition and closure under postprocessing.

MODULE 2:
DIFFERENTIAL PRIVACY

Module 2: Differential Privacy
• Differential Privacy Definition

[D 06]
Differential Privacy
For every pair of inputs that
For every output …
differ in one row
D1 D2 O
If algorithm A satisfies differential privacy then
Pr[A(D1) = O]
< exp(ε) (ε>0)
Pr[A(D2) = O] .
Intuition: adversary should not be able to use
Module 2
output O to distinguish between any D1 and D2
Why pairs of datasets that differ in one row?
differ in one row
D1 D2 O
Simulate the presence or absence of a

single record

Why all pairs of datasets …?
differ in one row
D1 D2 O
Guarantee holds no matter what the

other records are.

Why all outputs?
P [ A(D1) = O1 ]
A(D1) = O1
D1
.
.
.
D2
P [ A(D2) = Ok ]
Set of all
Module 2 outputs
Should not be able to distinguish whether input
was D1 or D2 no matter what the output
Worst discrepancy
in probabilities
D1
.
.
.
D2
Privacy Parameter ε
differ in one row
D1 D2 O
Pr[A(D1) = O] ≤ eε Pr[A(D2) = O]
Controls the degree to which D1 and D2 can be distinguished.

Smaller ε gives more privacy (and worse utility)
Outline of the Module 2
• Differential Privacy

Can deterministic algorithms satisfy differential privacy?

Non trivial deterministic algorithms do not
satisfy differential privacy
Space of all inputs Space of all outputs

Non-trivial deterministic algorithms do not
satisfy differential privacy
Non-trivial: at least 2 outputs in image

There exist two inputs that differ in one entry
mapped to different outputs.
Pr = 1
Pr = 0

Random Sampling …
… also does not satisfy differential privacy
Input Output
D1 D2 O
Pr[D1 à O]
Pr[D2 à O] = 0 implies =∞
Pr[D2 à O]

Output Randomization
Query
Database Add noise to

true answer
Researcher
• Add noise to answers such that:
– Each answer does not leak too much information about
the database.
– Noisy answers are close to the original answers.

[DMNS 06]
Laplace Mechanism
Query q
Database True answer

q(D) + η
q(D)
Researcher
Privacy depends on η
the λ parameter
h(η) α exp(-η / λ) Laplace Distribution – Lap(λ)

0.6
Mean: 0, 0.4
Variance: 2 λ2 0.2
0
-10 -8 -6 -4 -2 0 2 4 6 8 10
How much noise for privacy?
[Dwork et al., TCC 2006]
Sensitivity: Consider a query q: I à R. S(q) is the

smallest number s.t. for any neighboring tables D,
D’,
| q(D) – q(D’) | ≤ S(q)
Thm: If sensitivity of the query is S, then the

following guarantees ε-differential privacy.
λ = S/ε
Sensitivity: COUNT query D
Disease
• Number of people having disease (Y/N)
Y
• Sensitivity = 1
Y
• Solution: 3 + η, Y
where η is drawn from Lap(1/ε) N

– Mean = 0
– Variance = 2/ε2 N

Sensitivity: SUM query
• Suppose all values x are in [a,b]
• Sensitivity = b

Privacy of Laplace Mechanism
• Consider neighboring databases D and D’
• Consider some output O

Utility of Laplace Mechanism
• Laplace mechanism works for any function that
returns a real number
• Error: E(true answer – noisy answer)2
= Var( Lap(S(q)/ε) )
= 2*S(q)2 / ε2

Exponential Mechanism
• For functions that do not return a real number …
– “what is the most common nationality in this room”:
Chinese/Indian/American…
• When perturbation leads to invalid outputs …

– To ensure integrality/non-negativity of output

[MT 07]
Consider some function f (can be deterministic or probabilistic):
Inputs Outputs
How to construct a differentially private version of f ?

• Scoring function w: Inputs x Outputs à R
• D: nationalities of a set of people

• #(D, O): # people with nationality O
• f(D): most frequent nationality in D
• w(D, O) = |#(D, O) - #(D, f(D))|

• Scoring function w: Inputs x Outputs à R
• Sensitivity of w
where D, D’ differ in one tuple

Given an input D, and a scoring function w,
Randomly sample an output O from Outputs with

probability
• Note that for every output O, probability O is output > 0.

[W 65]
Randomized Response (a.k.a. local randomization)
D O
Disease Disease
(Y/N) (Y/N)
Y Y
With probability p,
Report true value
Y N
With probability 1-p,
N Report flipped value N
Y N
N Y
N N

Differential Privacy Analysis
• Consider 2 databases D, D’ (of size M) that
differ in the jth value
– D[j] ≠ D’[j]. But, D[i] = D’[i], for all i ≠ j
• Consider some output O

Utility Analysis
• Suppose n1 out of N people replied “yes”, and rest said “no”
• What is the best estimate for π = fraction of people with disease = Y?
• Extract an estimate through post-processing
πhat = {n1/n – (1-p)}/(2p-1)
• E(πhat) = π
• Var(π hat) =
Sampling Variance due to coin flips

Laplace Mechanism vs Randomized Response
Privacy
• Provide the same ε-differential privacy guarantee
• Laplace mechanism assumes data collected is trusted

• Randomized Response does not require data collected
to be trusted
– Also called a Local Algorithm, since each record is perturbed

Laplace Mechanism vs Randomized Response
Utility
• Suppose a database with N records where µN records
have disease = Y.
• Query: # rows with Disease=Y
• Std dev of Laplace mechanism answer: O(1/ε)

• Std dev of Randomized Response answer: O(√N)

Outline of the Module 2
• Differential Privacy

Why Composition?
• Reasoning about privacy of
a complex algorithm is hard.
• Helps software design

– If building blocks are proven to be private, it would
be easy to reason about privacy of a complex
algorithm built entirely using these building blocks.

A bound on the number of queries
• In order to ensure utility, a statistical database
must leak some information about each
individual
• We can only hope to bound the
amount of disclosure
• Hence, there is a limit on number of

queries that can be answered

[DN03]
Dinur Nissim Result
• A vast majority of records in a database of size
n can be reconstructed when n log(n)2 queries
are answered by a statistical database …
… even if each answer has been arbitrarily

altered to have up to o(√n) error

Sequential Composition
• If M1, M2, ..., Mk are algorithms that access a private
database D such that each Mi satisfies εi -differential
privacy,
then running all k algorithms sequentially satisfies

ε-differential privacy with ε=ε1+...+εk

Privacy as Constrained Optimization
• Three axes
– Privacy
– Error
– Queries that can be answered
• E.g.: Given a fixed set of queries and privacy

budget ε, what is the minimum error that can
be achieved?

Parallel Composition
• If M1, M2, ..., Mk are algorithms that access
disjoint databases D1, D2, …, Dk such that each
Mi satisfies εi -differential privacy,
then running all k algorithms in “parallel”

satisfies ε-differential privacy
with ε= max{ε1,...,εk}

Postprocessing
• If M1 is an ε-differentially private algorithm that
accesses a private database D,
then outputting M2(M1(D)) also satisfies ε-

differential privacy.

Case Study: K-means Clustering

Kmeans
• Partition a set of points x1, x2, …, xn into k
clusters S1, S2, …, Sk such that the following is
minimized:
!
!
!! − !! !
!
!!! !! ∈!!
!
Mean of the cluster Si

Kmeans
Algorithm:
• Initialize a set of k centers
• Repeat
Assign each point to its nearest center
Recompute the set of centers
Until convergence …
• Output final set of k centers

[BDMN 05]
Differentially Private Kmeans
• Suppose we fix the number of iterations to T
• In each iteration (given a set of centers):
1. Assign the points to the new center to form clusters
2. Noisily compute the size of each cluster
3. Compute noisy sums of points in each cluster

Each iteration uses ε/T privacy budget, total privacy loss is ε

Exercise: Which of these steps expends privacy budget?

Exercise: Which of these steps expends privacy budget?
1. Assign the points to the new center to form clusters NO
2. Noisily compute the size of each cluster YES
3. Compute noisy sums of points in each cluster YES

What is the sensitivity?
2. Noisily compute the size of each cluster 1
3. Compute noisy sums of points in each cluster Domain

size

Each iteration uses ε/T privacy budget, total privacy loss is ε
2. Noisily compute the size of each cluster Laplace(2T/ε)

Laplace(2T |dom|/ε)

Results (T = 10 iterations, random initialization)
Original Kmeans algorithm Laplace Kmeans algorithm
• Even though we noisily compute centers, Laplace kmeans can distinguish

clusters that are far apart.
• Since we add noise to the sums with sensitivity proportional to |dom|,

Laplace k-means can’t distinguish small clusters that are close by.

Summary
• Differentially private algorithms ensure an
attacker can’t infer the presence or absence of a
single record in the input based on any output.
• Building blocks
– Laplace, exponential mechanism and randomized
response
• Composition rules help build complex
algorithms using building blocks

MODULE 3:
ANSWERING QUERIES ON
TABULAR DATA
Module 3: Answering queries on
Tabular data
• Answering query workloads on tabular databases
• Theory: two seminal results
• Survey of algorithm design ideas

– Low dimensional range queries
– Queries on high dimensional data
• Open Questions
Problem Formulation
• Input:
– Private database D consisting of a single table
(each tuple represents data of single individual)
– Workload W of counting queries* with arbitrary predicates
SELECT COUNT(*) FROM D WHERE <P>;
• Output: (noisy) answers to W

• Requirement: query answering algorithm satisfies
differential privacy
* Many techniques can also support linear queries: SELECT

SUM(f(t))
Module 3
FROM D where user-defined f maps tuple to [0,1]
Analysis of temporal & spatial patterns
Counting query:
SELECT COUNT(*)
FROM Tweets
WHERE
moodScale=k
AND t <= time
AND time < t+1
AND UScounty = C
Module 3 http://www.ccs.neu.edu/home/amislove/twittermood/
Statistical agencies: data publishing
• U.S. Census Bureau publishes
statistics that can typically be
derived from marginals
• A marginal over attributes
𝐴1, … , 𝐴𝑘 reports count for
each combination of attribute
values.
– aka cube, contingency table
– E.g. 2-way marginal on
EmploymentStatus and Gender
• Thousands of marginals
released
https://factfinder.census.gov/
[WWLTRD09]
Genome Wide Association Studies

• Goal: to study genetic factors associated with a given disease
• Collect subsets of the population with diseases (case) and without
(control)
• Extract SNPs (specific DNA subsequences)
– For each SNP, usually find 2 alleles (alternative forms of the gene)
• Counting queries: Compute allele

frequencies in both the case and
control groups
(marginal over SNPxDisease)
• Perform association test using these frequencies (e.g., Chi Square Test)
to identify SNPs highly associated with disease

Problem variant: offline vs. online
• Offline (batch):
– Entire W given as input, answers computed in batch
• Online (adaptive):
– W is sequence q1, q2, … that arrives online
– Adaptive: analyst’s choice for qi can depend on
answers 𝑎1, … , 𝑎'()
• Answering linear queries online is strictly harder
than answering them offline [BSU16].

Important aspects of problem:
Data and query complexity
• Data complexity
– Dimensionality: number of attributes
– Domain size: number of distinct attribute combinations
– Many techniques specialized for low dimensional data
• Query complexity
– Many techniques designed to work well for a specific
class of queries
– Classes (in rough order of difficulty): histograms, range
queries, marginals, counting queries, linear queries

Solution variants: query answers vs.
synthetic data
Two high-level approaches to solving problem
1. Direct:
– Output of the algorithm is list of query answers
2. Synthetic data:
– Algorithm constructs a synthetic dataset D’, which can
be queried directly by analyst
– Analyst can pose additional queries on D’
(though answers may not be accurate)

Theory
• Given negative result of Dinur-Nissim, is there any
hope?
• Yes!
– The key to Dinur-Nissim is that query answers have
independent noise (which can cancel out)
– To answer more queries, query error must be correlated
• Examples of correlation
– Use some query answers to approximate others
– Construct a synthetic database that is approximately
accurate for queries of interest

[BLR08]
Answering Exponentially Many Queries

Offline
• Key technical insight: For any set of count
queries 𝑊, there exists a small database 𝐷’
consistent with 𝐷 on every query in 𝑊.
– Small: 𝑂(log (|𝑊|)/𝛼2)
– Consistent: error for any q in W is at most ⍺
• Result follows from learning theory:
– Estimates on small random sample will generalize to
population

[BLR08]

Offline
• Input: 𝑊, 𝜀
• Output: 𝐷′
• The Mechanism:
– 𝑇 = {all small databases 𝐷’}
– 𝑓 𝐷, 𝐷’ = −𝑚𝑎𝑥@∈B |𝑞(𝐷) – 𝑞(𝐷’)|
– Output 𝐷’ ∈ 𝑇 using Exponential
Mechanism applied to 𝑓
• Theorem: Is 𝜀-private and w.h.p. error ⍺ is at most
log |domain| log 𝑊 )/J
𝑂( )
𝜀𝐷
[BLR08]

Offline
• Input: 𝑊, 𝜀
• Output: 𝐷′
• The Mechanism:
– 𝑇 = {all small databases 𝐷’}
– 𝑓 𝐷, 𝐷’ = −𝑚𝑎𝑥@∈B |𝑞(𝐷) – 𝑞(𝐷’)|
– Output 𝐷’ ∈ 𝑇 using Exponential Limitations
Mechanism applied to 𝑓 • Impractical:
runtime
• Theorem: Is 𝜀-private and w.h.p. error ⍺ is exponential
at most
log |domain| log 𝑊 • Offline
)/J
(for online
𝑂( ) [HR10])
see
𝜀𝐷
Case study: range queries over spatial
data
Input: sensitive data D Input: range query
workload W
Shown is workload of 3
range queries
BeijingTaxi dataset[1]:
4,268,780 records of (lat,lon)
pairs of taxi pickup locations
in Beijing, China in 1 month.
Scatter plot of input data
Task: compute answers to workload W over private input D
[1]Raw data from: Taxi trajectory open

Module 3 Tutorial: Differential Privacy in the Wild dataset, Tsinghua university,125
China.
http://sensor.ee.tsinghua.edu.cn, 2009.
Baseline algorithm
1. Discretize attribute domain
into cells
2. Add noise to cell counts
(Laplace mechanism)
3. Use noisy counts to either…
1. Answer queries directly
(assume distribution is
Scatter plot of input data uniform within cell)
2. Generate synthetic data
Limitations (derive distribution from
• Granularity of discretization counts and sample)
– Coarse: detail lost
– Fine: noise overwhelms signal
• Noise accumulates: squared error
grows linearly with range
Error analysis
Error baseline algorithm
[𝑥) , 𝑥) ] on every range query
• Dataset is 1D: single

attribute, 128 possible values
• Baseline simply adds noise to [𝑥)KL, 𝑥)KL]
every cell count
• Incurs high error on large
ranges
[𝑥) , 𝑥)KL]
Large range Small range
Error
[CPSSY12]
Improved algorithm: private quad-tree

• Input: maximum height 𝒉, minimum
leaf size 𝑳, data set D
• Recurse on nodes of tree:
– Add 𝐿𝑎𝑝(1/𝜀 ) noise to node count
– Split node domain into quadrants
– Create child nodes
• Stop when:
– Noisy count of node ≤ 𝑳
– Max height 𝒉 is reached
quadtree
• Intuition:
– Early stopping controls granularity of
discretization
– To answer long range queries, leverage
hierarchy of noisy counts

[CPSSY12]
Improved algorithm: private quad-tree

• Input: maximum height 𝒉, minimum
leaf size 𝑳, data set D
• Recurse on nodes of tree:
– Add 𝐿𝑎𝑝(1/𝜀 ) noise to node count
Exercise: Let ℎ′ ≤ ℎ be – Split node domain into quadrants
height of resulting tree. – Create child nodes
Algorithm satisfies ε’- • Stop when:
differential privacy for ε’ – Noisy count of node ≤ 𝑳
equal toquadtree – Max height 𝒉 is reached
• Intuition:
1. εℎ′ – Early stopping controls granularity of
2. εℎ discretization
– To answer long range queries, leverage
3. 4εℎ hierarchy of noisy counts
4. ε4h
Using tree structure for range queries
• Quad-tree, each node has noisy count. How use
to answer range queries?
• Idea 1: given range query q, find smallest set of
noisy counts from tree that “cover” q
• Idea 2 (better):
– Observation: node’s noisy count and sum of
children’s noisy counts are two estimates of same
quantity
– Combine estimates using statistical inference

Error comparison
Error tree algorithm Error baseline algorithm
on every range query on every range query
Error Error
Hierarchy lowers error on large ranges but incurs
Error comparison
slightly higher error for small ranges
Error tree algorithm Error baseline algorithm

on every range query on every range query
Error Error
Credit: Cormode
Data transformations
• Can think of trees as transform of input
Original Transform Noise Noisy “Invert”

Coefficients Private Data
Data Coefficients
• Can apply other data transformations

• Goal: pick a low sensitivity transform that preserves
good properties of data
Module 3 Tutorial: Differential
133Privacy in the Wild
Linear transformations
• Examples
– Hierarchical: Trees [HRMS10,QYL13], full height quadtree
[CPSSY12]
– Haar Wavelet [XWG10]
– Discrete Fourier transform [BCDKMT07]
• Inverting transformation
– Some transformations (e.g. tree) have redundancy (over-
constrained), so require pseudo-inverse
• Matrix Mechanism [LHRMM10,LM12,LM13]
– Formalizes problem of designing a linear transform that is tailored
to the queries
• Error rates are independent of input (assumes linear transform
is “full rank”)

Lossy transformations
• Variants
– Drop “small” coefficients:
• Quad-tree with early stopping (noisy count threshold)
• Fourier coefficients: EFPA [ACC12], [RN10]
– Data-adaptive discretization:
• PrivTree [ZXX16], KD-Tree [CPSSY12], DAWA [LHMY14], [DNRR15], [QYL13], [BLR08]
– Data-adaptive measurement:
• MWEM [HLM12], DualQuery [GAHRW14]
– Randomized transforms: sketches and compressed sensing
• JL Transform [BBDS12], Compressive mechanism [LZWY11]
• “Inverting” transformation
– Because lossy, they are under-constrained, requires estimation
• Error rates depend on input
– Can be much lower (trades off small bias for lower variance)
– Warrants careful empirical evaluation; algorithms are “data dependent”

High dimensional data
• Generally an under-studied area
• Two algorithms, both synthetic data generators
– PrivBayes [ZCPSX14]
– DualQuery [GAHRW14]
• Common properties
– Limited to binary attributes
– Designed to support low-order marginals
(and other workloads well approximated by
marginals, such as classification)

[ZCPSX14]
PrivBayes ABC
CD …..
ABCDEFG…………… BE DEF
decompose
High-dimensional table 𝑇 Low-dimensional tables
• Method: Add noise

• Use Bayesian network to learn data distribution
• After BN learned, generate synthetic data by
sampling from BN
• Challenge: privately choosing good decomposition CD …..
ABC
ABCDEFG…………… DEF
BE
Noisy table 𝑇 ∗ reconstruct
Noisy tables
[GAHRW14]
Dual Query
• Problem of generating synthetic data formulated as a zero sum game
between
– Data player: generates synthetic records to reduce query error
– Query player: chooses queries with high error (on current synthetic dataset)
– Theoretical analysis of utility comes from studying equilibrium of game
1. Let Qt be distribution over queries (Q1 is uniform) • What makes it practical?

2. For t = 1… T – Unlike some prior work[HLM12],
avoids storing distribution over
a) S ç Query player samples s domain (exponential)
queries from Qt Privacy – Approximate solution may be
good enough!
– Optimization problem can be
b) Data player finds record xt that solved with off-the-shelf solvers
NP-Hard
maximizes total answer on S • Case study on 500K 3-way
marginals over 17K binary
attributes, using CPLEX solver
c) Qt+1ç Query player updates(xt, D, Qt)
3. Output synthetic database x1, …, xt

[HMMCZ16]
Empirical benchmarks
• [HMMCZ16] propose a novel evaluation framework for
standardized evaluation of privacy algorithms.
• Study of algorithms for range query answering over 1 and 2D
• Benchmark website www.dpcomp.org
Some data-
dependent
algorithms fail to
offer benefits at
larger scales (no.
of tuples)

Open questions
• Robust and private algorithm selection
– Pythia (Thursday 2 PM DP Session)
• Error bounds for data-dependent algorithms
• Empirical evaluation of algorithms for high
dimensional data

Differential Privacy References
CACM Articles
• [D11] Dwork, A firm foundation for private data analysis. In CACM, 2011.
• [MK15] Machanavajjhala & Kifer, Designing statistical privacy for your data. In CACM, 2015.
Book
• [DR14] Dwork & Roth, The Algorithmic Foundations of Differential Privacy. In Foundations
and Trends, 2014.
Tutorials
• [C13] Graham Cormode. Building blocks of privacy: Differentially private mechanisms. In
PrivDB, 2013.
• [YZMWX12] Yang et al., Differential privacy in data publication and analysis. In SIGMOD, 2012.
• [HLMPT11] Hay et al., Privacy-aware Data Management in Information Networks. In SIGMOD,
2011.
• [CS14]K. Chaudhuri and A. D. Sarwate. Differential privacy for signal processing and machine
learning. In WIFS, 2014.
• [KNRS13] S. P. Kasiviswanathan, K. Nissim, S. Raskhodnikova, and A. Smith. Analyzing graphs
with node differential privacy. In TCC, 2013.

Module 1 References
• [S02] Sweeney, “K-anonymity”, IJFUKS 2010
• [DN03] Dinur, Nissim, “Revealing information while preserving privacy”,
PODS 2003
• [D06] Dwork, “Differential Privacy”, ICALP 2006
• [MKGV06] Machanavajjhala, Kifer, Gehrke, Venkitasubramaniam, “L-
Diversity” ICDE 2006
• [GKS08] Ganta, Kasiviswanathan, Smith, “Composition attacks and auxiliary
information in data privacy”, KDD 2008
• [KL10] Kifer, Lin, “Towards an Axiomatization of Statistical Privacy and
Utility.”, PODS 2010
• [VSJO13] Vaidya, Shafiq, Jiang, Ohno-Machado, “Identifying inference
attacks against healthcare data repositories”, AMIA 2013
• [MK15] Machanavajjhala, Kifer, “Designing statistical privacy for your data”,
CACM 2015
Module 2 References
• [W65] Warner, “Randomized Response” JASA 1965
• [DN03] Dinur, Nissim, “Revealing information while preserving privacy”,
PODS 2003
• [BDMN05] Blum, Dwork, McSherry, Nissim, “Practical privacy: the SuLQ
framework”, PODS 2005
• [D06] Dwork, “Differential Privacy”, ICALP 2006
• [DMNS06] Dwork, McSherry, Nissim, Smith, “Calibrating noise to sensitivity
in private data analysis”, TCC 2006
• [MT07] McSherry, Talwar, “Mechanism Design via Differential Privacy”,
FOCS 2007

Module 3 References
• [ACC12] Ács et al. Differentially private histogram publishing through lossy compression. In ICDM, 2012.
• [BBDS12] Blocki et al. The johnson-lindenstrauss transform itself preserves differential privacy. In FOCS, 2012.
• [BCDKMT07] Barak et al. Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In PODS, 2007.
• [BLR08] Blum et al. A learning theory approach to noninteractive database privacy. In STOC, 2008.
• [DNRR15] Dwork et al. Pure Differential Privacy for Rectangle Queries via Private Partitions. In ASIACRYPT, 2015.
• [CPSSY12] Cormode et al. Differentially Private Spatial Decompositions. In ICDE, 2012.
• [GAHRW14] Gaboardi et al. Dual Query: Practical Private Query Release for High Dimensional Data. In ICML, 2014.
• [HLM12] Hardt et al. A simple and practical algorithm for differentially private data release. In NIPS, 2012.
• [HMMCZ16] Hay et al. Principled Evaluation of Differentially Private Algorithms using DPBench. In SIGMOD, 2016.
• [HRMS10] Hay et al. Boosting the accuracy of differentially private histograms through consistency. In PVLDB, 2010.
• [LHMY14] Li et al. A data- and workload-aware algorithm for range queries under differential privacy. In PVLDB, 2014.
• [LHRMM10] Li et al. Optimizing linear counting queries under differential privacy. In PODS, 2010.
• [LM12] Li et al. An adaptive mechanism for accurate query answering under differential privacy. In PVLDB, 2012.
• [LM13] Li et al. Optimal error of query sets under the differentially-private matrix mechanism. In ICDT, 2013.
• [LZWY11] Li et al. Compressive mechanism: utilizing sparse representation in differential privacy. In WPES, 2011.
• [QYL13] Qardaji et al. Understanding hierarchical methods for differentially private histograms. In PVLDB, 2013.
• [QYL13] Qardaji et al. Differentially private grids for geospatial data. In ICDE, 2013.
• [RN10] Rastogi et al. Differentially private aggregation of distributed time-series with transformation and encryption. In SIGMOD, 2010.
• [WWLTRD09] Wang et al. Privacy-preserving genomic computation through program specialization. In CCS, 2009.
• [XWG10] Xiao et al. Differential privacy via wavelet transforms. In ICDE, 2011.
• [ZCPSX14] Zhang et al. PrivBayes: private data release via bayesian networks. In SIGMOD, 2014.
• [ZXX16] Zhang et al. PrivTree: A Differentially Private Algorithm for Hierarchical Decompositions. In SIGMOD, 2016.

Differential Privacy in the Wild
(Part 2)
A Tutorial on Current Practices and Open Challenges
Outline of the Tutorial
1. What is Privacy?
2. Differential Privacy
3. Answering Queries on Tabular Data
Break
4. Applications I: Machine Learning
5. Privacy in the Real World
6. Applications II: Networks and Trajectories

MODULE 4:
APPLICATIONS I: MACHINE
LEARNING


Differentially Private Machine Learning
Predicts flu or not, based on patient symptoms

Trained on sensitive patient data Credit: Chaudhuri

From Attributes to Labeled Data
Yes No 99F No
Sore Throat Fever Temperature Flu?
1 0 99 -
Data Label

Classifying Sensitive Data
- - - ++
--- ++ +
Classification
- -- + Algorithm
-
- +
Private Public
Data Classifier

Classifying Sensitive Data
- - - ++
--- ++ Distribution 𝑃 over
- -- + labelled examples
- +
Goal: Find a vector 𝑤 that separates + from - for

most points from 𝑃
Key: Find a simple model to fit the samples

Empirical Risk Minimization
• Training dataset:
– Labeled data 𝐷 = 𝑥𝑖, 𝑦𝑖 ∈ 𝑋×𝑌: 𝑖 = 1,2, … , 𝑛
– e.g binary classification 𝑋 = 𝑅 4 , 𝑌 = {−1, +1}
– Train predictor over 𝐷: 𝜔: 𝑋 → 𝑌
• Empirical risk (or error) of 𝜔 over 𝐷 is

@
1
; 𝑙(𝜔, (𝑥> , 𝑦> ))
𝑛
>AB
– 𝑙 is a loss function: how well 𝜔 classifies (𝑥> , 𝑦> )

Examples of Loss Function
- - - ++
--- ++
- -- +
- +
Risk: Hinge loss 𝑙 𝑧 = max (0,1 − 𝑧)
Optimizer: Support vector machines (SVM)
Risk: Logistic loss 𝑙 𝑧 = log (1 + exp −𝑧 )

Optimizer: Logistic regression

Regularized ERM
• Goal: Labeled data 𝐷 = {(𝑥𝑖 , 𝑦𝑖 )}, find
@
1 U
1
𝑓 𝐷 = 𝑎𝑟𝑔𝑚𝑖𝑛R 𝜆 ∥ 𝜔 ∥ + ; 𝑙(𝜔, 𝑥> , 𝑦> )
2 𝑛
>AB
Regularizer Risk
(Model Complexity) (Training Error)

Why ERM is not private for
Support Vector Machine (SVM)?
-
--- +
+
--- + +
- +
+
SVM solution is a combination of support vectors

If one support vector moves, solution changes

Why ERM is not private for
Support Vector Machine (SVM)?
-
--- +
+
--- +
- +
+
+
SVM solution is a combination of support vectors

If one support vector moves, solution changes


– E.g Deep learning

How to make ERM private?
-
--- +
+ Pick 𝜔 from distribution
--- + + near the optimal solution
- +
+

Output Perturbation
• Goal:
𝑓V 𝐷 = 𝑓 𝐷 + 𝑛𝑜𝑖𝑠𝑒 =
@
1 U
1
Z𝑎𝑟𝑔𝑚𝑖𝑛R 𝜆 ∥ 𝜔 ∥ + ; 𝑙(𝜔, 𝑥> , 𝑦> )[ + 𝑛𝑜𝑖𝑠𝑒
2 𝑛
>AB
Theorem: [CMS11] If ∥ 𝑥> ∥≤ 1 and 𝑙 is 1-Lipschitz,

then for any 𝐷, 𝐷′ with dist 𝐷, 𝐷b = 1,
b U
||𝑓 𝐷 − 𝑓(𝐷 )||U ≤ (𝐿U -sensitivity)
d@

Output Perturbation
• Goal:
𝑓V 𝐷 = 𝑓 𝐷 + 𝑛𝑜𝑖𝑠𝑒 =
@
1 U
1
Z𝑎𝑟𝑔𝑚𝑖𝑛R 𝜆 ∥ 𝜔 ∥ + ; 𝑙(𝜔, 𝑥> , 𝑦> )[ + 𝑛𝑜𝑖𝑠𝑒
2 𝑛
>AB
• Laplace 𝑛𝑜𝑖𝑠𝑒 drawn from

U
– Magnitude: drawn from Γ(d, )
d@g
– Direction: uniform at random

Property of Real Data
-
--- +
+
--- + + Loss
- +
+
Perturbation
Optimization surface is very steep in some direction

à High loss if perturbed in those directions
Objective Perturbation
• Insight: Perturb optimization surface and then
optimize
𝑓V 𝐷 =
@
1 U
1
𝑎𝑟𝑔𝑚𝑖𝑛R Z 𝜆 ∥ 𝜔 ∥ + ; 𝑙(𝜔, 𝑥> , 𝑦> ) + 𝑛𝑜𝑖𝑠𝑒[
2 𝑛
>AB
• Main idea: add noise as part of the computation:

– Regularization already changes the objective to protects
against overfitting.
– Change the objective a little bit more to protect privacy.

optimize
𝑓V 𝐷 =
@
1 U
1
2 𝑛
>AB
• 𝑛𝑜𝑖𝑠𝑒 drawn from

B
– Magnitude: drawn from Γ(d, )
g
– Direction: uniform at random

optimize
𝑓V 𝐷 =
@
1 U
1
2 𝑛
>AB
• Theorem: If 𝑙 is convex and double-differentiable

with 𝑙′ z ≤ 1, 𝑙 bb z ≤ 𝑐 then Algorithm
k
satisfy 𝜖 + 2 log 1 + -DP. [CMS11]
@d

Accuracy
• Number of samples for error 𝛼 w.r.t the best predictor
– Fewer samples implies higher accuracy
𝑑: #dimensions
𝛾: margin
𝜖: privacy
𝛼: error
𝛾, 𝛼, 𝜖 < 1
B
• Normal SVM:
mn U
B 4
• Objective perturbation: +
mn U mgn
B 4
• Output perturbation: + o.q
mn U m gn



Stochastic Gradient Descent (SGD)
• Initial 𝜔u
• Incremental gradient update for 𝑡 = 0 … 𝑇 − 1
– Take a random example 𝑥x , 𝑦x ∈ 𝐷
– Update 𝜔xyB = 𝜔x − 𝜂x (𝛻𝑙 𝜔x , 𝑥x , 𝑦x )
• 𝜂x is the step size
• Permutation-based SGD (PSGD)

– Randomly permute training examples 𝐷 = {(𝑥𝑖 , 𝑦𝑖 )} to
feed each pass of SGD
– Cycle 𝐷 for 𝑘 times: 𝑘-pass PSGD

White Box Approaches
• Initial 𝜔u
• Incremental gradient update for 𝑡 = 0 … 𝑇 − 1
– Take a random example 𝑥x , 𝑦x ∈ 𝐷
– Update 𝜔xyB = 𝜔x − 𝜂x (𝛻𝑙 𝜔x , 𝑥x , 𝑦x + 𝑛𝑜𝑖𝑠𝑒)
• 𝜂x is the step size
• Permutation-based SGD (PSGD)

– Randomly permute training examples 𝐷 = {(𝑥𝑖 , 𝑦𝑖 )} to
feed each pass of SGD
– Cycle 𝐷 for 𝑘 times: 𝑘-pass PSGD

White Box Approaches
• Cycle 𝐷 for 𝑘 times
• Basic composition:
– Each pass is 𝜖-DP, then 𝑘-pass is 𝜖𝑘-DP.
– Privacy loss grows linearly with the number of
passes. [CSC13, SS15]
• Tighter privacy loss with advanced composition
– Convex objectives [JKT12, BST14]
– Deep learning with non-convex objectives [ACG16]

Advanced Composition
[DRV10]
• Composing 𝑘 algorithms, each satisfying 𝜖-DP

ensures 𝜖} -DP with probability 1 − 𝛿
B
𝜖} = 𝑂 𝜖 𝑘 ln + 𝑘𝜖 U
•
• Analyze privacy loss as a random variable: given

output 𝑜 and neighbors 𝐷, 𝐷b
ƒ„ [† ‡ Aˆ]
𝑃𝐿 𝑜 = ln Š
ƒ„ [† ‡ Aˆ]

Advanced Composition
[DRV10]
• Composing 𝑘 algorithms, each satisfying 𝜖-DP

ensures 𝜖} -DP with probability 1 − 𝛿
B
𝜖} = 𝑂 𝜖 𝑘 ln + 𝑘𝜖 U
•
• Each algorithm has privacy loss 𝑃𝐿(𝑜)

– Worst case (DP): Pr 𝑃𝐿 𝑜 ≤ 𝜖 = 1
– Expected loss: E 𝑃𝐿(𝑜) ≤ 𝜖(𝑒 g −1)
– Total privacy loss 𝜖} is bounded by Azuma’s inequality

Black Box Approaches
• Add noise to the final output of SGD [WLK17]
– No need code change to the SGD program
– Only sample noise once
– Allow 𝜖-DP and (𝜖, 𝛿)-DP
– Better convergence for constant number of
passes based on the new bound over 𝐿U sensitivity
of 𝑘-pass PSGD Initialize 𝜔u
For 𝑡 = 0 … 𝑇 − 1
“Bolt-on DP” 𝜔xyB ← update 𝜔x
@ Thursday 2 PM DP Session) Output 𝜔 •
𝜔 • ← 𝜔 • + noise
𝐿U sensitivity of 𝑘-pass PSGD
• 𝑙 is 𝛽-smooth and 𝐿-Lipschitz, the 𝐿U sensitivity is
U
– 2𝑘𝐿𝜂 if 𝑙 is convex, 𝜂x = 𝜂 ≤
‘
U’ B B
– if 𝑙 is 𝜆-strongly convex, 𝜂x = min( , ), |𝐷| = 𝑛
d@ ‘ dx
– Convergence when 𝑘 = 𝑂(1)
[WLK17] (black box) [BST14] (white box)
Convex “
𝑑 𝑑 log U 𝑛
𝑂( ) 𝑂( )
𝑛 𝑛
Strongly 𝑑 log 𝑛 𝑑 log U 𝑛
convex 𝑂( ) 𝑂( )
𝑛 𝑛

Other Fitting Techniques
• Mini-batching SGD
– At step 𝑡, the gradient is updated with a batch of
examples 𝐵x from 𝐷
– Add noise per iteration
• 𝜔xyB = 𝜔x − 𝜂x (𝐸 –— ,˜— ∈™š 𝛻𝑙 𝜔x , 𝑥> , 𝑦> + 𝑛𝑜𝑖𝑠𝑒)
– Or add noise to the final output
• Proximal algorithm for strongly convex

optimization [JKT12]
– Add noise per iteration
– Hard to implement than SGD


Other Important Problems
• Practical issues
– Parameter tuning: exponential mechanism [CMS11]
– High dimensional data: random projection [WLK17]
• Solve non-convex optimization

– Deep learning [SS15, ACG16]
• Understand what can be learned privately [KLNR11]

– Private learning w/o efficiency: PAC, SQ
– What cannot be learned privately? e.g. threshold functions
where hypothesis space is infinite

DP Algorithms for ML
• Private ERM– a specific learning approach
Output perturbation Objective perturbation
argmin (objective) + noise argmin (objective + noise)
• Private SGD – a fitting technique

White box approaches Black box approaches
Initialize 𝜔u Initialize 𝜔u
For 𝑡 = 0 … 𝑇 − 1 For 𝑡 = 0 … 𝑇 − 1
𝜔xyB ← update 𝜔x 𝜔xyB ← update 𝜔x
𝜔xyB ← 𝜔xyB + noise Output 𝜔 •
Output 𝜔 •
𝜔 • ← 𝜔 • + noise
MODULE 5:
PRIVACY IN THE REAL WORLD

Module 5: Privacy in the real world
– OnTheMap RAPPOR


http://onthemap.ces.census.gov/

Data underlying OnTheMap
• Employee • Job
– Age – Start date
– Sex – End date
– Race & Ethnicity – Worker & Workplace IDs
– Education – Earnings
– Home location (Census block)
• Employer
– Geography (Census blocks)
– Industry
– Ownership (Public vs Private)
Why release such data?

Why privacy is needed?
US Code: Title 13 CENSUS

It is against the law to make any publication whereby
the data furnished by any particular establishment or
individual under this title can be identified.
Violating the statutory confidentiality pledge can result

in fines of up to $250,000 and potential imprisonment
for up to five years.

OnTheMap
Workplace
Residence (Quasi-identifier)
(Sensitive)
Worker ID Origin Destination
1223 MD11511 DC22122

1332 MD2123 DC22122
1432 VA11211 DC22122
Census Blocks
2345 PA12121 DC24132
1432 PA11122 DC24132
1665 MD1121 DC24132
1244 DC22122 DC22122

Current approach: Synthetic Database
• Sanitize the dataset one time
• Analyst can perform arbitrary computations on
the synthetic datasets
• Unlike in query answering systems

– No need to maintain state (of queries asked)
– No need to track privacy loss across queries or
across analysts

Synthetic Residence Generator
(circa 2007)
Origin Destination # Noise
Workers
2
MD1151 DC22122 1
1 0
MD2123 DC22122 3 + 1
+ Dirichlet
VA11211 DC22122 12 2 Resampling
PA12121 DC24132 43 1
PA11122 DC24132 5 9
MD1121 DC24132 2 0
DC22122 DC22122 1
No noise is added to origin-destination pairs with true count 0

Can lead to re-identification attacks.
Differentially Private Synthetic Data
Generator
• Noise added to all origin-destination (o-d) pairs
– Even if 0 count in the original dataset
• Noise calibrated to ensure a variant called

probabilistic differential privacy
• Utility ensured by coarsening the domain and

probabilistically dropping o-d pairs with no
support.
Evaluation
• Utility measured by average commute distance for each
destination block.
Experimental Setup:
• OTM: Currently published
OnTheMap data used as original data.
• All destinations in Minnesota.
• 120,690 origins per destination.
• chosen by pruning out blocks that are >
100 miles from the destination.
• Total ε = 8.3, δ = 10-5

– OnTheMap RAPPOR


A dilemma
• Cloud services want to protect their users,
clients and the service itself from abuse.
• Need to monitor statistics of, for instance,

browser configurations.
– Did a large number of users have their home page
redirected to a malicious page in the last few hours?
• But users do not want to give up their data

Browser configurations can identify users

Problem [Erlingsson et al CCS’14]
What are the frequent unexpected

Chrome homepage domains?
à To learn malicious software

that change Chrome setting
without users’ consent
.
.
.
Finance.com WeirdStuff.com
Fashion.com

Why privacy is needed?
Liability (for server)
Storing unperturbed sensitive
data makes server
accountable (breaches,
subpoenas, privacy policy
violations)
.
.
.
Finance.com
WeirdStuff.com
Fashion.com

Solution
Can use Randomized Response …
On a binary domain:
With probability p report true value
With probability 1-p report false value
… but the domain of all urls is very large …

… original value is reported with very low prob.

RAPPOR Solution
• Idea 1: Use bloom filters to reduce the domain size
Participant 8456 in cohort 1

Finance.com
True value: "The number 68"
0 1
4 signal bits
Bloom filter (B): | | | |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||| ||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||
69 bits on
Fake Bloom
filter (B'): ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
145 bits on
Report sent
to server: | || | |||| || ||| | | ||| ||| || ||| || | | ||| | || || || | | ||| | ||| | | || ||| | || | ||| ||| |||||| || || ||| ||| || | ||||||||||||||||||||
| | | || | | ||| | | |||| | || | | || | || | || || || | | | || | || | | | ||| |||| | | ||| | || || || || | | | || | | ||| | | || | | | ||| | || | | | |||| | | || || |
1 8 32 64 128 256
Bloom filter bits
Figure 1: Life of a RAPPOR report: The client value of the string “The number 68” is hashed onto the Bloom
Module 5 using h (here 4) hash functions.
filter B Tutorial:
For Differential
this string,Privacy randomized response B 0 is produces and
in the Wild
a Permanent 52
memoized by the client, and this B 0 is used (and reused in the future) to generate Instantaneous randomized
RAPPOR Solution
• Idea 2: Use RR on bloom filter bits

Finance.com
True value: "The number 68"
0 1
4 signal bits
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||| ||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||
69 bits on
Fake Bloom
filter (B'): ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
145 bits on
Report sent
to server: | || | |||| || ||| | | ||| ||| || ||| || | | ||| | || || || | | ||| | ||| | | || ||| | || | ||| ||| |||||| || || ||| ||| || | ||||||||||||||||||||
| | | || | | ||| | | |||| | || | | || | || | || || || | | | || | || | | | ||| |||| | | ||| | || || || || | | | || | | ||| | | || | | | ||| | || | | | |||| | | || || |
1 8 32 64 128 256
Bloom filter bits
filter B Tutorial:
For Differential
this string,Privacy in the
a Permanent randomized response B 0 is produces and
Wild 53
memoized by the client, and this B 0 is used (and reused in the future) to generate Instantaneous randomized
RAPPOR Solution
• Idea 3: Again use RR on the Fake bloom filter
Why randomize two times?
- Chrome collects information each day

Finance.com
True value:
- "The
Wantnumber
perturbed
68"values to look different
0 1
on different days to avoid linking
4 signal bits
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||| ||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||
69 bits on
Fake Bloom
filter (B'): ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
145 bits on
Report sent
to server: | || | |||| || ||| | | ||| ||| || ||| || | | ||| | || || || | | ||| | ||| | | || ||| | || | ||| ||| |||||| || || ||| ||| || | ||||||||||||||||||||
| | | || | | ||| | | |||| | || | | || | || | || || || | | | || | || | | | ||| |||| | | ||| | || || || || | | | || | | ||| | | || | | | ||| | || | | | |||| | | || || |
1 8 32 64 128 256
Bloom filter bits
filter B Tutorial:
For Differential
this string,Privacy randomized response B 0 is produces and
in the Wild
a Permanent 54
0
memoized by the client, and this B is used (and reused in the future) to generate Instantaneous randomized
Server Report Decoding
• Step 5: estimates bit frequency from reports 𝑓V(𝐷)
• Step 6: estimate frequency of candidate strings with
regression from 𝑓V(𝐷)
1 1 0 1 0 0 0 1 0 1
0 1 0 1 0 0 0 1 0 0
.
.
.
0 1 0 1 0 0 0 1 0 1
𝑓V(𝐷)
.
.
.
23 12 12 12 12 2 3 2 1 10
Finance.com Fashion.com
WeirdStuff.com
Evaluation
http://google.github.io/rappor/examples/report.html

Other Real World Deployments
• Differentially private password Frequency lists [Blocki et al. NDSS ‘16]
– release a corpus of 50 password frequency lists representing
approximately 70 million Yahoo! users
– varies from 8 to 0.002
• Human Mobility [Mir et al. Big Data ’13 ]
– synthetic data to estimate commute patterns from call detail records
collected by AT&T
– 1 billion records ~ 250,000 phones
• Apple will use DP [Greenberg. Wired Magazine ’16]
– in iOS 10 to collect data to improve QuickType and emoji suggestions,
Spotlight deep link suggestions, and Lookup Hints in Notes
– in macOS Sierra to improve autocorrect suggestions and Lookup Hints

– OnTheMap RAPPOR


Differential Privacy & Complex Datatypes
• Defining neighboring databases

– What is a record?
• Records can be correlated

– Unravels privacy guarantee

Graphs
Neighboring databases … differ in one record.
• In graphs, a record can be:

– An edge (u,v)
– The adjacency list of node u

Trajectories
• In location trajectories, a record can be:

– Each location in the trajectory
– A sequence of locations spanning a window of time
– The entire trajectory

US Census Bureau Data
• Employee • Job
– Age – Start date
– Sex – End date
– Race & Ethnicity – Worker & Workplace IDs
– Education – Earnings
– Home location (Census block)
• Employer
– Geography (Census blocks)
– Industry
– Ownership (Public vs Private)
US Census Bureau Data
Existing LODES Data as Presented in the
Tool
• A record can be:

– An employee
– An employer
– Something else?
• Come to talk on Thursday
Employment in Lower Manhattan Residences o
Lower Manh
Available at http://onthemap.ces.census.gov/.
Samuel Haney The Cost of Provable Privacy: A Case Study on

Differential Privacy & Complex Datatypes
• Defining neighboring databases

– What is a record?
• Records can be correlated

– Unravels privacy guarantee

[KM11]
Correlations and DP
Bob Alice
• Want to release the number of edges between blue

and green communities.
• Should not disclose the presence/absence of Bob-
Alice edge.

[KM11]
Adversary knows how social networks evolve
Depending on the social network evolution model,

(d2-d1) is linear or even super-linear in the size of the network.

[KM11]
Differential privacy fails to avoid breach
Output (d1 + δ)
δ ~ Laplace(1/ε)
Output (d2 + δ)
Adversary can distinguish between the two

worlds if d2 – d1 is large.

[KM11]
Reason for Privacy Breach

• Pairs of tables that differ
in one tuple
• cannot distinguish them
Tables that do not

satisfy background
knowledge
Space of all
possible tables

[KM11]
Reason for Privacy Breach

can distinguish between
every pair of these tables based
on the output
Space of all
possible tables

No Free Lunch Theorem
It is not possible to guarantee any utility in addition
to privacy, without making assumptions about
• the data generating distribution

[KM11]
• the background knowledge available

to an adversary
[DN 10]

Need a formal theory to understand the
privacy ensured by DP

[KM12, KM14]
Pufferfish

Pufferfish Semantics
• What is being kept secret?
• Who are the adversaries?
• How is information disclosure bounded?

– (similar to epsilon in differential privacy)

Sensitive Information
• Secrets: S be a set of potentially sensitive statements
– “individual j’s record is in the data, and j has Cancer”
– “individual j’s record is not in the data”
• Discriminative Pairs:
Mutually exclusive pairs of secrets.
– (“Bob is in the table”, “Bob is not in the table”)
– (“Bob has cancer”, “Bob has diabetes”)
– Denotes an adversary’s possible beliefs about a target individual.

Adversaries
• We assume a Bayesian adversary who is can be completely
characterized by his/her prior information about the data
– We do not assume computational limits
• Data Evolution Scenarios: set of all probability distributions

that could have generated the data ( … think adversary’s prior).
– No assumptions: All probability distributions over data instances are

possible.
– I.I.D.: Set of all f such that: P(data = {r1, r2, …, rk}) = f(r1) x f(r2)
x…x f(rk)

Information Disclosure
• Mechanism M satisfies ε-Pufferfish(S, Spairs, D), if

Pufferfish Semantic Guarantee
Posterior odds Prior odds of

of s vs s’ s vs s’

Customizing Privacy
• Setup secrets and discriminative pairs based on
the requirements of what must be kept secret
• Set up data generating distributions to capture

correlations known to the adversary
• Pufferfish results in privacy definition that

bounds the adversary’s posterior and prior odds
for every discriminative pair.
Advantages
• Privacy defined more generally in terms of
customizable secrets rather than records
– Better capture legal privacy policies
• Can better explore privacy-utility tradeoff by

varying secrets and adversaries
– See application to US Census Bureau Data
(Thursday 2PM DP Session)
• Gives a deeper understanding of the protections

afforded by existing privacy definition
Pufferfish & Differential Privacy
– !!! !: record i takes the value x
– !!! !: record i is not in the database
– !!"#$% = !!! , !!! ∀! ∈ !"#, ∀!!"#$!%!! !
• Attackers should not be able to tell whether a

record is in or out of the database

• Data evolution:
– For all θ = [ f1, f2, f3, …, fk ]
! !"#" = ! Θ] = ! !! (!! )!
!! ∈!
• Adversary’s prior may be any distribution that

makes records independent
– !!"#$% = !!! , !!! ∀! ∈ !"#, ∀!!"#$!%!! !
• Data evolution:
– For all θ = [ f1, f2, f3, …, fk ]
! !"#" = ! Θ] = ! !! (!! )!
!! ∈!
A mechanism M satisfies differential privacy
if and only if
it satisfies Pufferfish instantiated using Spairs and {θ}

Challenges with Pufferfish
• Setting up data generating distributions are tricky
– Adversary’s knowledge is unknown
• Little work on algorithm design for Pufferfish

– Notable Exceptions: Blowfish (next module), and
Wasserstein mechanism (Thursday 2 PM DP Session)
• Not all Pufferfish definitions are “good”

– Many do not satisfy composition

Summary
• Complex datatypes require custom privacy definitions
– No Free Lunch theorem
– Varied notions of neighboring databases
– Correlations can unravel privacy ensured by DP algorithms
• Pufferfish is a mathematical framework for defining

privacy
– A rigorous way to customize privacy to applications
– Helps understand semantics of privacy definitions

MODULE 6:
APPLICATIONS II: NETWORK &
TRAJECTORIES

Pufferfish Semantics
• What is being kept secret?
• Who are the adversaries?
• How is information disclosure bounded?

– (similar to epsilon in differential privacy)

Examples: Graphs
• Neighboring graphs differ in presence/absence of one edge
• Pufferfish meaning:
– Data: matrix of bits
– Secrets: whether or not an edge (𝑢, 𝑣) is in the graph -- bit at
(𝑢, 𝑣) is 0 or 1
– Data generating distributions: All graphs where each edge 𝒆 is
independently present with probability 𝑝𝑒.
• But …
– Edges are not independent in real graphs

Examples: Location Trajectories
• Neighboring tables differ in one location (at one point of
time) of an individual
• Pufferfish meaning
– Data: a matrix of locations
– Secrets: Whether or not individual was at some location at some
point of time
– Data Generating Distributions: All trajectories where an
individual’s location at some time is independent of all other
locations …
• But …
– Current location depends on previous locations…

Common Themes
• What are secrets and neighboring datasets for different
applications?
• Correlations between protected objects requires further

redefinition of privacy
• New privacy definitions requires new algorithm design
• Many open questions

Social Network
• Represented using a graph 𝐺(𝑉, 𝐸)
– 𝑉: node set (individuals)
– 𝐸: edge set (social links)

Social Network
• Attacks on graph anonymization
“it is possible for an adversary to learn whether edges
exist or not between specific targeted pairs of nodes.”
[BDK07]
“a third of the users on both Twitter and Flickr, can

be re-identified in the anonymous Twitter graph
with only a 12% error rate.” [NS09]

Private Analysis of Social Network
Trusted Server
𝑓𝑝𝑟𝑖𝑣𝑎𝑡𝑒(𝐺, 𝜖)

v1 v2 v3 vN
Differential Privacy: 𝑓𝑝𝑟𝑖𝑣𝑎𝑡𝑒 is 𝜖-differentially private if

for all neighbors 𝐺, 𝐺 b and output 𝑆:
𝑃𝑟 𝑓𝑝𝑟𝑖𝑣𝑎𝑡𝑒 𝐺, 𝜖 ∈ 𝑆 ≤ 𝑒 g Pr[𝑓𝑝𝑟𝑖𝑣𝑎𝑡𝑒(𝐺 b , 𝜖) ∈ 𝑆]

Variants of DP for Social Network
• Edge Differential Privacy
Secret: social links between individuals
Two graphs are neighbors if they differ in the

presence of one edge
Variants of DP for Social Network
• Node Differential Privacy
Secret: presence of an individual
Two graphs are neighbors if one can be obtained by

another by adding or removing a node and all its edges
Examples for Social Network Statistics
• Degree distribution 𝐷(𝐺)
• Number of edges
• Counts of small subgraphs
e.g triangles, 𝑘-triangles, 𝑘-stars, etc.
• Cut
• Distance to nearest graph with a certain
property
• Joint degree distribution
Examples for Social Network Statistics
• Degree distribution 𝐷(𝐺)
Degree 0 1 2 3 4 5
Frequency 0 5 0 0 0 1
𝐷(𝐺) = [0,5,0,0,0,1]

Global Sensitivity of Degree
Distribution
• What is the global sensitivity of the degree
distribution of 𝐺 𝑉, 𝐸 , where 𝑉 = 𝑛 under
Edge differential privacy?

Distribution
Edge differential privacy?
Answer: 4
Remove edge (𝑖, 𝑗), the changes in degree frequency
Degree … 𝑑𝑖 -1 𝑑𝑖 … 𝑑𝒋-1 𝑑𝒋 …
Frequency … +1 -1 … +1 -1 …

Distribution (Exercise)
Node differential privacy?

Distribution (Exercise)
Node differential privacy? Answer: 2n-1
Highly Sensitive!!à Too much noise
𝐷(𝐺) = [0,5,0,0,0,1] 𝐷(𝐺′) = [5,0,0,0,0,0]

Approach to Highly Sensitive Queries
Key idea:
• Projection 𝐺 on 𝜽-degree-bounded graphs 𝐺𝜃
• Answer queries on 𝐺𝜃 instead of 𝐺
𝐷ª𝐺 = 𝐷 𝐺𝜃 + 𝑛𝑜𝑖𝑠𝑒
• Existing approaches for degree distribution

– Node-based truncation [KNRS13]
– Lipschitz extensions [RS15]
– Edge-based projection [DLL16]

How much noise?
• Answer queries on 𝐺𝜃 instead of 𝐺
𝐷ª𝐺 = 𝐷 𝐺𝜃 + 𝑛𝑜𝑖𝑠𝑒
• Sensitivity
– Node-based truncation: 2𝜃 ⋅ 𝛿
• Smooth sensitivity approach Applicable to count
[NRS07] - edges
– Lipschitz extensions: 6𝜃 - small subgraph
[KNRS13]
– Edge-based projection: 2𝜃 + 1

Work on Edge DP
• Degree distribution
– Global sensitivity + Post-processing [HLM09, HRMS10,
KS12, LK13]
• Small subgraph counting
– Smooth sensitivity [NR07]
– Ladder function [ZCPSX15]
– Noisy sensitivity [DL09]
• Cut
– Random projections, global sensitivity [BBDS1212]
– Iterative updates [HR10, GRU12]
• Releasing differentially private graph
– Exponential random graphs [LM14, KSK15]

Outline of Module 6
Social network Location trajectory

Location Trajectory
High uniqueness & High predictability

[MHVB13] [SQBB10]
‘show me how you move and I will tell you who you are’
[GKC10]
‘geosocial service “check in” dropped from 18% to 12%’

in the Pew Research Center’s Internet Project, 2013

Rich Domain for Secrets
Neighbors differ What to hide?
in
Trajectory All properties of the individual are secret
e.g. Where is home location?
Window Properties within a small window
e.g. Did user visit home in the last hour?
Event Properties at a specific time
e.g. Was user at work or at home at time t?
Geo- Some properties (not all) at a specific time
indistinguishability e.g. Did user visit near home at time t ?

in

in

in

Overview of Privacy Definitions
in

Protect a Single Location
• Protect a single location
𝑥′
e.g. Location-based Services (LBS) 𝑟
to find a restaurant 𝑥
• Not reveal the exact location
• Revealing an approximate
location is ok
• A mechanism satisfies 𝝐-geo-indistinguishability iff for all

observations 𝑆 ⊆ 𝑍, for all 𝑟 > 0 , for all neighbors 𝑥, 𝑥′ ∶
𝑑(𝑥, 𝑥′) ≤ 𝑟, [ABCP13]
Pr 𝑆 𝑥 ≤ 𝑒 g² 𝑃𝑟 [𝑆|𝑥 b ]

Different Levels of Protection
Total budget: 𝜖𝑡
• Event level DP
Privacy Budget 𝜖 𝜖 𝜖 …. 𝜖 𝜖
Released locations z1 z2 z3 …. zt-1 zt

True locations x1 x2 x3 …. xt-1 xt
1 2 3 …. t-1 t time
If a person staying at a location for a long time 𝑥1 = 𝑥2 =

⋯ = 𝑥´ , averaging (𝑧1 , . . , 𝑧𝑤) leaks the true location.

[KPXP14]
• 𝑤-event DP
– Neighboring stream prefix 𝑥B, 𝑥U, . . , 𝑥x , 𝑥Bb, 𝑥Ub. . , 𝑥xb
• For any 𝑖 < 𝑗, if 𝑥> ≠ 𝑥>b , and 𝑥¶ ≠ 𝑥¶b
then 𝑗 − 𝑖 + 1 ≤ 𝑤
• 𝑥> and 𝑥>b are the same or neighboring
àProtect updates happening within 𝑤-event with privacy
budget 𝜖
x1 x2 xi …. xj …. xt
x'1 x’2 x’i …. x’j …. x’t
≤𝑤 time
[KPXP14]
• 𝑤-event DP
– E.g. 𝑤=3
𝜖
Released locations z1 z2 z3 z4 …. zt-1 zt

True locations x1 x2 x3 x4 …. xt-1 xt
1 2 3 …. t-1 t time

[KPXP14]
• 𝑤-event DP
– E.g. 𝑤=3
𝜖

1 2 3 …. t-1 t time

[KPXP14]
• 𝑤-event DP
– Allow budget allocation strategy:
• Adaptive assign privacy budgets to events within the same
𝑤-window
• E.g. 𝑤=3 𝜖
g g g g
“ · U ·
1 2 3 …. t-1 t time

[KPXP14]
• 𝑤-event DP
– Allow budget allocation strategy:
• Adaptive assign privacy budgets to events within the same
𝑤-window
5
• E.g. 𝑤=3 𝜖≤𝜖
6
g g g g
“ · U ·
1 2 3 …. t-1 t time

• Trajectory-level DP for entire trajectory
– Neighboring databases 𝐷1, 𝐷2
• Differ in one entire trajectory
– Release aggregate statistics for multiple trajectories
[CAC12, HCMPS15]
Privacy Budget 𝜖
Released locations z1 z2 z3 …. zt-1 zt

True locations x1 x2 x3 …. xt-1 xt
1 2 3 …. t-1 t time

Different Level of Privacy Protection
• Edge DP • 𝜖-indistinguishability
• Node DP • Event DP
• 𝑤-event DP
• Trajectory level DP

Outline of Module 6

Blowfish Privacy [HMD14]
• Special case of Pufferfish that satisfies sequential

composition
• A framework for redefining neighboring databases for

complex datatypes using a policy graph
– Captures many neighboring definitions
– Handles correlations induced by constraints on database
• Prior data releases
• Location constraints

Blowfish
• Differential Privacy:
For all outputs 𝑜, for all |𝐷B − 𝐷U | = 1,
Pr [𝐴(𝐷B ) = 𝑜] ≤ 𝑒 g Pr [𝐴(𝐷U ) = 𝑜]
Redefined
neighbor
• Blowfish Privacy: relation
For all outputs 𝑜, for all 𝐷B , 𝐷U ∈ 𝑵𝑮
Pr [𝐴(𝐷B ) = 𝑜] ≤ 𝑒 g Pr [𝐴(𝐷U ) = 𝑜]

Algorithm Design Simplified
[HMD16]
• Transformational equivalence between Blowfish and

differential privacy
• No need to do algorithm design from scratch for each

definition
• Answering queries under a Blowfish privacy policy is

equivalent in error to answering transformed queries
under differential privacy

Intuition
For all outputs 𝑜, for all |𝐷B − 𝐷U | = 1,
Pr [𝐴(𝐷B ) = 𝑜] ≤ 𝑒 g Pr [𝐴(𝐷U ) = 𝑜]
Is equivalent to
For all outputs 𝑜, for all 𝐷B , 𝐷U

Pr [𝐴(𝐷B ) = 𝑜] ≤ 𝑒 g¼(‡o ,‡½ ) Pr [𝐴(𝐷U ) = 𝑜]
where Δ(𝐷B , 𝐷U ) is the size of symmetric difference

Definitions differ in distance metrics
• Differential Privacy:
Pr [𝐴(𝐷B ) = 𝑜] ≤ 𝑒 g¼(‡o ,‡½ ) Pr [𝐴(𝐷U ) = 𝑜]
Distance metric
• Blowfish Privacy: imposed by
neighbor relation
Pr [𝐴(𝐷B ) = 𝑜] ≤ 𝑒 g¿(‡o ,‡½ ) Pr [𝐴(𝐷U ) = 𝑜]

Transformational equivalence …
… achieved by embedding distance imposed by
neighbor definition in Blowfish to distance metric
imposed by neighbors that differ in one record.

Extending Differential Privacy via
Metrics
• [CEBP13] propose generalizations of differential
privacy using metrics
– Special case of Pufferfish and generalizes Blowfish
• [WSC17] use a similar intuition to derive a generalized

sensitivity notion for using Laplace mechanism for
Pufferfish Thur 2pm DP Session
– Based on Wasserstein distances
– Computing this generalized sensitivity can be intractable
– Examples of intractability also shown in [KM11, HMD14]


Open Questions
• Identify realistic policies for real world applications.
– Is it socially acceptable to offer weaker privacy protection to
high-degree nodes?
• Algorithm design under complicated constraints or
correlations.
– Correlations within both trajectories and between users, e.g.
family members may share similar trajectories patterns.
– Highly sensitive queries under constraints or correlations.
• Privacy analysis across different guarantees.

MODULE 7:
SUMMARY

Module 7: Summary
• Recap of tutorial
• Five Cross-cutting ideas
• Challenges

Statistical database privacy
• Statistical database privacy is the problem of
releasing aggregates while not disclosing individual
records
• Privacy desiderata
– Resilience to background knowledge
– Composition
– Avoid privacy by obscurity: public
algorithms/implementations
• Utility desiderata
– Accurate
– Useful

Tutorial Summary
• Applications
– Query answering
– Machine learning
– Analysis of network data
– Trajectories
• Real-world deployments:
– U.S. Census Bureau OnTheMap: commuting patterns
– Google RAPPOR: browser settings
• Formal privacy definitions
– Differential privacy, Pufferfish, Blowfish

Cross-cutting ideas
1. Higher accuracy through careful composition
– Parallel composition, advanced composition
2. Where to inject noise?

– On input, output, intermediate result
– Find “information bottleneck” that has tight bound
on sensitivity
– May be dictated by application (e.g., RAPPOR)

Cross-cutting ideas
3. Lossy transformations
– Histograms: adaptive binning
– RAPPOR: bloom filters
– Social networks: degree-bounded graphs
– … results in bias-variance tradeoffs
4. Leverage domain knowledge

– OnTheMap: previously published data
– RAPPOR: heavy hitters
– Network data: tends to be sparse
– Histograms: smooth, sparse

Cross-cutting ideas
5. Privacy definition may be application specific
– Differential privacy is a rigorous definition that
protects individual tuples…
– … but this may not align with semantics of
application
– In your application…
• What are the secrets?
• Who are the adversaries? What data correlations can
they exploit?

Challenge 1: From Prototypes to
Deployments
• Community needs more examples of real-world
deployments
[Erlingsson et al, CCS 2014]
• Demonstrate usefulness in real applications

• These raise important research problems
– Hardening against side-channel attacks [M12]
– Matching formal privacy guarantee to needs of
application
Challenge 2: From Algorithms to
Systems
• Today, getting DP to work in practice requires a
team of experts
• … resembles early days of database research…

… without exception ad hoc,
cumbersome, and difficult to
use – they could really only be
used by people having highly
specialized technical skills …
E. F. Codd on the state of

databases in early 1970s

Challenge 2: From Algorithms to
Systems
• Today, getting DP to work in practice requires a team of
experts
• Example of systems work: Privacy Integrated Queries

(PINQ) [M10]
– Guarantees that programs satisfy privacy…
– … but program author responsible for accuracy
• Need more research on systems…

– Modular components
– Automatic optimization

Challenge 3: Communicating privacy-
utility tradeoffs
• Inherent tradeoff
between utility
SIGMOD 2016
and privacy
• Must be
communicated to
stakeholders
• Need for tunable
algorithms

Thank you!
Ashwin Machanavajjhala
Assistant Professor, Duke University
“What does privacy mean … mathematically?”
Michael Hay
Assistant Professor, Colgate University
“Can algorithms be provably private and useful?”
Xi He
Ph.D. Candidate, Duke University
“Can privacy algorithms work in real world systems?”

Module 4 References
• [KLNR11] Kasiviswanathan, Lee, Nissim, Raskhodnikova, “What Can We Learn
Privately?”, SIAM Journal of Computing, 2011
• [CMS11] Chaudhuri, Monteleoni, Sarwate, “Differentially Private Empirical Risk
Minimization”, MLRJ 2011
• [CSC13] Song, Chaudhuri, Sarwate, “Stochastic gradient descent with differentially
private updates”, GlobalSIP 2013
• [BST14] Bassily, Smith, Thakurta, “Private Empirical Risk Minimization, Revisited”,
FOCS 2014
• [SS15] Shokri, Shmatikov, “Privacy-Preserving Deep Learning”. CCS 2015
• [JKT12] Jain, Kothari, Thakurta , “Differentially Private Online Learning”, COLT,
2012
• [ACG16] Abadi, Chu, Goodfellow, McMahan, Mironov, Talwar, Zhang, “Deep
Learning with Differential Privacy”, SIGSAC 2016
• [WLK17] Wu, Li, Kumar, Chaudhuri, Jha, Naughton, “Bolt-on Differential Privacy for
Scalable Stochastic Gradient Descent-based Analytics”, SIGMOD 2017
• [DRV10] Dwork, Rothblum, Vadhan. Boosting and differential privacy, FOCS 2010

Module 5 References
• [DN10] Dwork, Naor “On the Difficulties of Disclosure Protection in
Statistical Databases or the Case for Differential Privacy”, JPC 2010
• [KM11] Kifer, Machanavajjhala, “No Free Lunch in Data Privacy”, SIGMOD
2011
• [KM12] Kifer, Machanavajjhala, “A rigorous and customization framework
for privacy”, PODS 2012
• [KM14] Kifer, Machanavajjhala, “Pufferfish: A framework for mathematical
privacy definitions”, TODS 2014
• [HMD14] He, Machanavajjhala, Ding, “Blowfish Privacy” SIGMOD 2014
• [HMD16] Haney, Machanavajjhala, Ding, “Design of Policy Aware
Differentially Private Algorithms”, VLDB 2016
• [CEBP13] Chatzikokolakis, Andres, Bordenabe, Palamidessi, “Broadening the
scope of differential privacy using metrics”, PoPETS 2013
• [WSC16] Wang, Song, Chaudhuri, “Privacy Preserving analysis of Correlated
Data”, Corr abs/1603.03977, 2016

Module 6 References
• [BDK07] Backstrom, Dwork, Kleinberg, “Wherefore Art Thou R3579X? Anonymized Social Networks,
Hidden Patterns, and Structural Steganography”, WWW 2007
• [NS09] Narayanan, Shmatikov, “De-anonymizing Social Networks.” IEEE Security & Privacy 2009
• [KNRS13] Kasiviswanathan, Nissim, Raskhodnikova, Smith, “Analyzing Graphs with Node Differential
Privacy”, TCC 2013
• [NRS07] Nissim, Raskhodnikova, Smith. “Smooth Sensitivity and Sampling in Private Data Analysis”, STOC
2007
• [RS15] Raskhodnikova, Smith, “Efficient Lipschitz Extensions for High-Dimensional Graph Statistics and
Node Private Degree Distributions”, arXiv:1504.07912v1, 2015
• [DLL16] Day, Li, Lyu. “Publishing Graph Degree Distribution with Node Differential Privacy”, SIGMOD
2016
• [NRS07] Nissim, Raskhodnikova, Smith. “Smooth sensitivity and sampling in private data analysis”, STOC
2007.
• [KRSY09] Karwa, Raskhodnikova, Smith, Yaroslavtsev, “Private Analysis of Graph Structure”, VLDB 2011
• [DL09] Dwork, Lei. “Differential privacy and robust statistics”, STOC 2009
• [HLMJ09] Hay, Li, Miklau, Jensen, “Accurate estimation of the degree distribution of private networks”,
ICDM 2009.
• [HRMS10] Hay, Rastogi, Miklau, Suciu. “Boosting the accuracy of differentially-private queries through
consistency”, PVLDB 2010.

Module 6 References
• [KS12] Karwa, Slavkovic. “Differentially Private Graphical Degree Sequences and Synthetic Graphs”, Privacy
in Statistical Databases 2012: 273-285
• [LK13] Lin, Kifer. “Information preservation in statistical privacy and bayesian estimation of unattributed
histograms”. SIGMOD 2013
• [GRU12] Gupta, Roth, Ullman, “Iterative Constructions and Private Data Release”, TCC 2012
• [ZCPSX15] Zhang, Cormode, Procopiuc, Srivastava and Xiao, “Private Release of Graph Statistics using
Ladder Functions”, SIGMOD 2015
• [KSK15] Karwa, Slavkovi´c, Krivitsky. “Differentially Private Exponential Random Graphs”,
arXiv:1409.4696v2 2015
• [LM14] Lu. Miklau, “Exponential random graph estimation under differential privacy”, KDD 2014
• [KKS15] Karwa, Krivitsky, Slavković, “Sharing Social Network Data: Differentially Private Estimation of
Exponential-Family Random Graph Models”, arXiv:1511.02930v1. 2015
• [BBDS12] Blocki, Blum, Datta, Sheffet, “The Johnson-Lindenstrauss Transform Itself Preserves Differential
Privacy”, FOCS 2012
• [HR10] Hardt, Rothblum, “ A Multiplicative Weights Mechanism for Privacy-Preserving Data
Analysis”, FOCS 2010
• [MHVB13] Montjoye, Hidalgo, Verleysen, Blondel, “ Unique in the crowd: The privacy bounds of human
mobility”. Sci. Rep., 3(1376), 2013.
• [SQBB10] Song, Qu, Blumm, Barabsi. “ Limits of predictability in human mobility”, Science, 2010

Module 6 References
• [GKC10] Gambs, Killijian, Cortez, “Show Me How You Move and I Will Tell You Who You Are”, SPRINGL
2010
• [ABCP13] Andrés, Bordenabe, Chatzikokolakis, Palamidessi, “Geo-Indistinguishability: Differential Privacy for
Location-Based Systems”, CCS 2013
• [XX15] Xiao, Xiong. “ Protecting Locations with Differential Privacy under Temporal Correlations”. CCS 2015
• [KPXP] Kellaris, Papadopoulos, Xiao, Papadias, “Differentially Private Event Sequences over Infinite Streams”,
VLDB 2014
• [HCMPS15] He, Cormode, Machanavajjhala, Procopiuc, Srivastava, “DPT: Differentially private trajectory
synthesis using hierarchical reference systems”, VLDB 2015
• [CAC12] Chen, Acs, Castelluccia, “Differentially private sequential data publication via variable-length n-
grams”, CCS 2012.
• [HMD14] He, Machanavajjhala, Ding, “Blowfish Privacy” SIGMOD 2014
• [HMD16] Haney, Machanavajjhala, Ding, “Design of Policy Aware Differentially Private Algorithms”, VLDB
2016
• [CEBP13] Chatzikokolakis, Andres, Bordenabe, Palamidessi, “Broadening the scope of differential privacy
using metrics”, PoPETS 2013
• [WSC17] Wang, Song, Chaudhuri, “Privacy Preserving analysis of Correlated Data”, SIGMOD, 2017
• [CYXX17] ]Yang Cao, Masatoshi Yoshikawa, Yonghui Xiao, Li Xiong, “Quantifying Differential Privacy under
Temporal Correlations”, ICDE 2017
• [KM11] Kifer, Machanavajjhala, “No Free Lunch in Data Privacy”, SIGMOD 2011

Differential Privacy in The Wild: A Tutorial On Current Practices and Open Challenges

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Differential Privacy in The Wild: A Tutorial On Current Practices and Open Challenges

Uploaded by

Copyright:

Available Formats

Differential Privacy in the Wild

A Tutorial on Current Practices and Open Challenges

“What does privacy mean … mathematically?”

“Can algorithms be provably private and useful?”

“Can privacy algorithms work in real world systems?”

Tutorial: Differential Privacy in the Wild 3

Tutorial: Differential Privacy in the Wild 5

De-identified records Statistics Predictive models

Tutorial: Differential Privacy in the Wild 6

Tutorial: Differential Privacy in the Wild 7

Tutorial: Differential Privacy in the Wild 8

Tutorial: Differential Privacy in the Wild 9

This is the goal of Differential Privacy.

Tutorial: Differential Privacy in the Wild 10

Tutorial: Differential Privacy in the Wild 11

• What privacy is not …

Tutorial: Differential Privacy in the Wild 13

• Theory: two seminal results

• Survey of algorithm design ideas

• Private Stochastic Gradient Descent

• Other Important Problems in Private Learning

• Privacy beyond Tabular Data

Tutorial: Differential Privacy in the Wild 16

Tutorial: Differential Privacy in the Wild 17

Tutorial: Differential Privacy in the Wild 18

Module 1 Tutorial: Differential Privacy in the Wild 19

• What privacy is not …

Hospital Census Google

Data Analysts Medical Information

Module 1 Tutorial: Differential Privacy in the Wild 22

Medical Data Voter List

Module 1 Tutorial: Differential Privacy in the Wild 23

Module 1 Tutorial: Differential Privacy in the Wild 24

Module 1 Tutorial: Differential Privacy in the Wild 25

Person 1 Person 2 Person 3 Person N

Module 1 Tutorial: Differential Privacy in the Wild 26

Person 1 Person 2 Person 3 Person N

Module 1 Tutorial: Differential Privacy in the Wild 27

Person 1 Person 2 Person 3 Person N

Module 1 Tutorial: Differential Privacy in the Wild 28

Person 1 Person 2 Person 3 Person N

Module 1 Tutorial: Differential Privacy in the Wild 29

Person 1 Person 2 Person 3 Person N

Module 1 Tutorial: Differential Privacy in the Wild 30

Application Data Private Analyst Function (utility)

Module 1 Tutorial: Differential Privacy in the Wild 31

• Settings where data collector may not be

Module 1 Tutorial: Differential Privacy in the Wild 32

Module 1 Tutorial: Differential Privacy in the Wild 33

Module 1 Tutorial: Differential Privacy in the Wild 34

• Statistical Database Privacy:

Module 1 Tutorial: Differential Privacy in the Wild 36

• Statistical Database Privacy:

Module 1 Tutorial: Differential Privacy in the Wild 37

Module 1 Tutorial: Differential Privacy in the Wild 38

• Statistical Database Privacy:

Module 1 Tutorial: Differential Privacy in the Wild 40

Module 1 Tutorial: Differential Privacy in the Wild 41

• Statistical Database Privacy:

Module 1 Tutorial: Differential Privacy in the Wild 42

Module 1 Tutorial: Differential Privacy in the Wild 43

which could not have learnt without access to M(D).

Module 1 Tutorial: Differential Privacy in the Wild 44