Big Data

Big Data & Information Privacy
Dr. Waqar Asif

Alvin Lam
1
We Live in the Age of Big Data
• “Big Data” refers to the acquisition and
analysis of massive collections of information,
collections so large that until recently, the
technology needed to analyze them did not
exist.
What is Big Data
• Walmart handles more than 1 million
customer transactions every hour.
• Facebook handles 40 billion photos from its
user base.
• Decoding human genome originally took 10
years to process; now it can be achieved in
one week
3
Three characteristics of Big Data
1. Volume
– A typical PC might have had 10 gigabytes of
storage in 2000.
– Facebook alone ingests 500 terabytes of new data
every day
– Boeing 737 generates 240 terabytes of flight data
during a single flight across the US.
4
2. Velocity
• Clickstreams and ad impressions capture user behavior
at millions of events per second.
• High frequency stock trading algorithms reflect market
changes within microseconds
• Machine to machine processes exchange data between
millions of devices
• Infrastructure and sensors generate massive log data in
real-time
• On-line gaming systems support millions of concurrent
users, each producing multiple inputs per second
5
3. Variety
• Big data isn’t just numbers, dates and strings.
Big Data is also geospatial data, 3D data, audio
and video, and unstructured text, including log
files and social media.
• Traditional database systems were designed to
address smaller volumes of structured data,
fewer updates or a predictable, consistent
data structure.
6
How is big data different
1) Automatically generated by a machine (e.g.
Sensor embedded in an engine)
2) Typically an entirely new source of data (e.g.
Use of the internet)
3) Not designed to be friendly (e.g. Text streams)
4) Multiple Data Generation points
Mobile Device, Microphones, Scanners/Readers,
Programs/Softwares, Social Media, Cameras
7
Big Data Analytics
• Examining Large amount of data
• Identification of hidden patterns, unknown
correlations
• Better business decisions: strategic and
operational
• Effective marketing, customer satisfaction,
increased revenue.
8
Types of tools used in Big Data
• Where processing is hosted?
– Distributed servers/ cloud
• Where data is stored
– Distributed storage
• What is the programming model
– Distributed processing
• How data is stored & indexed
– High performance schema-free database
• What operations are performed on data
– Analytics / Semantic Processing
9
Risks of Big Data
• Can become overwhelming
– Need the right people to solve the right problem
• Costs escalate too fast
– Isn’t necessary to capture 100%
• Regulations & Violations
– Legal Regulations
– Privacy of users
10
How Big data impacts on IT
• Big data is troublesome force presenting
opportunities with challenges to IT
organizations
• By 2015 4.4 million IT jobs in Big Data; 1.9
million is in the US alone
• In 2017, Data scientists was No. 1 job in the
Harvard’s ranking.
11
Benefits of Big data
• Real-time big data isn’t just a process for
storing petabytes or exabytes of data in a data
warehouse, Its about the ability to make
better decisions and take meaningful actions
at the right time.
• Fast forward to the present and technologies
like Hadoop give you the scale and flexibility
to store data before you know how you are
going to process it.
12
Benefits of Big Data
• Big Data analytics can reveal important patterns that
would otherwise go unnoticed.
• Taking the antidepressant Paxil together with the anti-
cholesterol drug Pravachol could result in diabetic
blood sugar levels. Discovered by
– (1) using a symptomatic footprint characteristic of very
high blood sugar levels obtained by analyzing thirty years
of reports in an FDA database, and
– (2) then finding that footprint in the Bing searches using an
algorithm that detected statistically significant
correlations. People taking both drugs also tended to enter
search terms (“fatigue” and “headache,” for example) that
constitute the symptomatic footprint.
Loss of Informational Privacy
• Informational privacy is the ability to determine for
ourselves what information about us others collect
and what they do with it.
• None of the developments just outlined can happen
without a loss of control over our data.
We Lose Control, They Gain It
Information
aggregators “We can
determine where
you work, how
Our data you spend your
time, and with
Businesses
whom, and with
87% certainty
where you'll be
next Thursday at
Government 5:35 p.m.”
New Privacy Problems?
• Changed privacy problems.
• A particularly complex and difficult tradeoff problem
takes center stage.
• Big Data presents a much wider range of both risks
and benefits—from detecting drug interactions to
reducing emergency room costs to improving police
response times.
Energy (Smart Meters)
Electricity Provider
Benefit
Constant s Efficient Energy
awareness about Faster outage Accurate Energy Unique insight on Management for
your energy bill detection Bill your consumption time based rates
Energy (Smart Meters)
When using such meters we
are not just the “consumers”
but a “product” for the
Electricity supply companies.
Initial Text Publishing Technique
• Data collection contains person-specific information.
• Sharing this collected information is valuable both in research
and business.
• Publishing this data may put a person’s privacy at risk.
Example: Hospital has some person specific patient data.
Identifier Non-Sensitive Data Sensitive Data

# Name Zip Age Nationality Condition
1 Kumar 13053 28 Indian Heart Disease
2 Bob 13057 29 American Heart Disease
3 Ivan 13063 35 Canadian Viral Infection
4 Umeko 13067 36 Japanese Cancer
Example Cont..
Identifier Non-Sensitive Data Sensitive Data

Published 1 Kumar 13053 28 Indian Heart Disease
Data
Data Leak!
# Name Zip Ag Nationality

e Voter List
1 John 13053 28 American
2 Bob 13057 29 American
3 Chris 13053 23 American
GDPR and Information Privacy
• Officially implemented in 25th May 2018
• Protection of personal data and privacy of EU
citizens
• Right to access
• Right to be forgotten
• Lawful ways of data protection
– Consent, Contract, Legitimate Interest, Vital
Interest, Legal Requirement, Public Interest
21
How do we Ensure Privacy?
• Non-interactive protection
– A protected version of the original data set
is released.
– Example: k-Anonymity, l-diversity, t-
closeness.
Contributor Data Data
s Collector Analyst
• Interactive protection
– A user queried data analysis is performed
on the original data set and then a
protected version of the results is returned
to the user. Q
– Differential Privacy A1
1Q
• Apple’s iOS 10 2
A
Contributor 2
Data Data
s Collector Analyst
K-Anonymity (Sweeney, Latanya. "k-anonymity: A model for protecting privacy."
International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10.05 (2002): 557-570.)
• Change data in such a way that for each tuple in the resulting
table there are at least (k-1) other tuples with the same value
for the quasi-identifier.
– Replace the original value by a semantically consistent but less specific
value.
Quasi- Data
Identifier Identifier Non-Sensitive Data Sensitive Data
1 Kumar 13053 28 Indian Heart Disease
K-Anonymity
• Generalization
ZIP ****
*
130*
*
1305 1306
* *
1305 1305 1306 1306
3 7 3 7
*
Ag Nationalit
e y *
<40
American Asian
<30 3*
28 29 36 35 Canadian US India Japanes

n e
K-Anonymity
# Zip Age Nationality Condition
1 1305* <40 * Heart Disease
2 1305* <40 * Heart Disease 2-minimal
3 1306* <40 * Viral Infection Generalization
4 1306* <40 * Cancer
1 130** <30 American Viral Infection
2 130** <30 Asian Cancer
3 130** 3* Asian Heart Disease
4 130** 3* American Heart Disease

1 130** <40 * Viral Infection
The more the 2 130** <40 * Cancer
generalization, the lesser 3 130** <40 * Heart Disease
the utility
4 130** <40 * Heart Disease
K-Anonymity
4-Anonymous data
•• Bob
Non-SensitiveBobisis31
28years
years old
old
Sensitive Non-Sensitive Sensitive
•• American
American
Zip Age Nationality
•• Male Condition Zip Age Nationality Condition
Male
{
1 13053 28 •Russian
• Lives
Livesin
inzip Heart
code Disease
zipcode 1 130** <30 * Heart Disease
2 13068 29 13053
13053
American Heart Disease 2 130** <30 * Heart Disease
3 13068 21 Japanese Viral Infection 3 130** <30 * Viral Infection
4 13053 23 American Viral Infection 4 130** <30 * Viral Infection
Ken
{
5 14853 50 Indian Cancer 5 148** >40 * Cancer
6 14853 55 Russian Heart Disease 6 148** >40 * Heart Disease
7 14850 47 American Viral Infection 7 148** >40 * Viral Infection
8 14850 49 American Viral Infection 8 148** >40 * Viral Infection
9 13053
10 13053
11 13068
12 13068
31
37
36
35
American
Indian
Japanese
American
Cancer
Cancer
Cancer
Bob
Cancer
{ 9 130**
10 130**
11 130**
12 130**
3*
3*
3*
3*
*
*
*
*
Cancer
Cancer
Cancer
Cancer
L-Diversity (Machanavajjhala, Ashwin, et al. "l-diversity: Privacy beyond k-anonymity."
ACM Transactions on Knowledge Discovery from Data (TKDD) 1.1 (2007): 3.)
Non-Sensitive Sensitive Non-Sensitive Sensitive

Zip Age Nationality Condition Zip Age Nationality Condition
1 13053 28 Russian Heart Disease 1 1305* ≤40 * Heart Disease
2 13068 29 American Heart Disease 4 1305* ≤40 * Viral Infection
3 13068 21 Japanese Viral Infection 9 1305* ≤40 * Cancer
4 13053 23 American Viral Infection 10 1305* ≤40 * Cancer
5 14853 50 Indian Cancer 5 1485* >40 * Cancer
6 14853 55 Russian Heart Disease 6 1485* >40 * Heart Disease
7 14850 47 American Viral Infection 7 1485* >40 * Viral Infection
8 14850 49 American Viral Infection 8 1485* >40 * Viral Infection
9 13053 31 American Cancer 2 1306* ≤40 * Heart Disease
10 13053 37 Indian Cancer 3 1306* ≤40 * Viral Infection
11 13068 36 Japanese Cancer 11 1306* ≤40 * Cancer
12 13068 35 American Cancer 12 1306* ≤40 * Cancer
Issues with L-Diversity
Non-Sensitive Sensitive
• L-diversity does not prevent probabilistic Zip Age Nationality Condition

Zip Age Salary Disease
inference: 1 1305* ≤40 * Heart Disease
1 476** 2* 3K Gastric ulcer
– If a certain sensitive information exists 4 1305* ≤40 * Viral Infection
multiple times then a probabilistic inference is 2 476** 2* 4K Gastritis
possible. 9 1305* ≤40 * Cancer
3 476** 2* 5K Stomach cancer
• The basic assumption of having multiple 10
4
1305* ≤40
4790* ≥40 6K
*
Gastritis
Cancer
sensitive values is non-realistic. 5 1485* >40 * Cancer
– For instance the database is the output of a 5 4790* ≥40 11K Flu
6 1485* >40 * Heart Disease
test result with only positive and negative as 6 4790* ≥40 8K Bronchitis
sensitive data with 99% of the overall data 7 1485* >40 * Viral Infection
negative. 7 476** 3* 7K Bronchitis
8 1485* >40 * Viral Infection
• Inference about salary and the Disease is 2
8 476** 3* 9K
1306* ≤40 *
Pneumonia
Heart Disease
easily made using quasi identifiers. 9 476** 3* 10K Stomach cancer
3 1306* ≤40 * Viral Infection
11 1306* ≤40 * Cancer
12 1306* ≤40 * Cancer
Differential Privacy (C. Dwork, F. McSherry, K. Nissim, and A. Smith.
Calibrating noise to sensitivity in private data analysis. In Proceedings of the 3rd Theory of Cryptography
Conference , pages 265– 84, 2006.)
All schools in the

UK teach in
Different languages
Q
1
A
1
Q
2A
2
Contributor Data Data
s Collector Analyst
• Bob is in the dataset

• Objective: You cant learn anything Is this a privacy
compromise?
new about Bob that he has not
provided. NO…
Differential Privacy
• The outcome of any analysis is essentially equally likely,
independent of whether any individual joins, or refrains from
joining, the dataset.
– Bob goes away, Ken joins, Bob is replaced by Ken.
• But this does not mean that statistical analysis cannot affect
the individual.
Q1: Smoking vs Cancer
=
A1: Smoking causes cancer with a
probability X
Data Collector Data Analyst Insurance

Premium
Differential Privacy
• M gives ϵ-differential privacy if for all pairs of data sets 𝑥, 𝑦
differing in the data of one person, and all events 𝑆.
Pr[𝑀(𝑥) ∈ 𝑆] ≤ 𝑒 𝜖 Pr[𝑀(𝑦) ∈ 𝑆]
Non-Sensitive Sensitive Non-Sensitive Sensitive Non-Sensitive Sensitive

Zip Age Nationality Condition Zip Age Nationality Condition Zip Age Nationality Condition
1 13053 28 Russian Heart Disease 1 13053 28 Russian Heart Disease 1 13053 28 Russian Heart Disease
2 13068 29 American Heart Disease 2 13068 29 American Heart Disease 2 13068 29 American Heart Disease
3 13068 21 Japanese Viral Infection 3 13068 21 Japanese Viral Infection 3 13068 21 Japanese Viral Infection
4 13053 23 American Viral Infection 4 13053 23 American Viral Infection 4 13053 23 American Viral Infection
5 14853 50 Indian Cancer 5 14853 50 Indian Cancer 5 14853 50 Indian Cancer
6 14853 55 Russian Heart Disease 6 14853 55 Russian Heart Disease 6 14853 55 Russian Heart Disease
9 13053 31 American Cancer 9 13053 31 American Cancer 9 13053 31 American Cancer
10 13053 37 Indian Cancer 10 13053 37 Indian Cancer 10 13053 37 Indian Cancer
11
So13068
we simply
36
forwardCancer
Japanese
a random11 value
13068
between
36 Japanese
33.3%-45.4%
Cancer
as 11the13068
output
36
toJapanese
the query.
Cancer
12 13068 35 American Cancer 12 13068 35 American Cancer 12 13068 35 American Cancer
References
Cloud Security & Privacy - An Enterprise Perspective Securing the Cloud: Cloud Computer Security
on Risks and Compliance Techniques and Tactics
Further resources, articles, case studies made available on Bb
32

Big Data

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data

Uploaded by

Copyright:

Available Formats

Big Data & Information Privacy

Dr. Waqar Asif

Identifier Non-Sensitive Data Sensitive Data

Identifier Non-Sensitive Data Sensitive Data

# Name Zip Ag Nationality

28 29 36 35 Canadian US India Japanes

# Zip Age Nationality Condition

Non-Sensitive Sensitive Non-Sensitive Sensitive

• L-diversity does not prevent probabilistic Zip Age Nationality Condition

All schools in the

• Bob is in the dataset

Data Collector Data Analyst Insurance

Non-Sensitive Sensitive Non-Sensitive Sensitive Non-Sensitive Sensitive

Further resources, articles, case studies made available on Bb

You might also like