Professional Documents
Culture Documents
Big Data
Big Data
1
We Live in the Age of Big Data
• “Big Data” refers to the acquisition and
analysis of massive collections of information,
collections so large that until recently, the
technology needed to analyze them did not
exist.
What is Big Data
• Walmart handles more than 1 million
customer transactions every hour.
• Facebook handles 40 billion photos from its
user base.
• Decoding human genome originally took 10
years to process; now it can be achieved in
one week
3
Three characteristics of Big Data
1. Volume
– A typical PC might have had 10 gigabytes of
storage in 2000.
– Facebook alone ingests 500 terabytes of new data
every day
– Boeing 737 generates 240 terabytes of flight data
during a single flight across the US.
4
Three characteristics of Big Data
2. Velocity
• Clickstreams and ad impressions capture user behavior
at millions of events per second.
• High frequency stock trading algorithms reflect market
changes within microseconds
• Machine to machine processes exchange data between
millions of devices
• Infrastructure and sensors generate massive log data in
real-time
• On-line gaming systems support millions of concurrent
users, each producing multiple inputs per second
5
Three characteristics of Big Data
3. Variety
• Big data isn’t just numbers, dates and strings.
Big Data is also geospatial data, 3D data, audio
and video, and unstructured text, including log
files and social media.
• Traditional database systems were designed to
address smaller volumes of structured data,
fewer updates or a predictable, consistent
data structure.
6
How is big data different
1) Automatically generated by a machine (e.g.
Sensor embedded in an engine)
2) Typically an entirely new source of data (e.g.
Use of the internet)
3) Not designed to be friendly (e.g. Text streams)
4) Multiple Data Generation points
Mobile Device, Microphones, Scanners/Readers,
Programs/Softwares, Social Media, Cameras
7
Big Data Analytics
• Examining Large amount of data
• Identification of hidden patterns, unknown
correlations
• Better business decisions: strategic and
operational
• Effective marketing, customer satisfaction,
increased revenue.
8
Types of tools used in Big Data
• Where processing is hosted?
– Distributed servers/ cloud
• Where data is stored
– Distributed storage
• What is the programming model
– Distributed processing
• How data is stored & indexed
– High performance schema-free database
• What operations are performed on data
– Analytics / Semantic Processing
9
Risks of Big Data
• Can become overwhelming
– Need the right people to solve the right problem
• Costs escalate too fast
– Isn’t necessary to capture 100%
• Regulations & Violations
– Legal Regulations
– Privacy of users
10
How Big data impacts on IT
• Big data is troublesome force presenting
opportunities with challenges to IT
organizations
• By 2015 4.4 million IT jobs in Big Data; 1.9
million is in the US alone
• In 2017, Data scientists was No. 1 job in the
Harvard’s ranking.
11
Benefits of Big data
• Real-time big data isn’t just a process for
storing petabytes or exabytes of data in a data
warehouse, Its about the ability to make
better decisions and take meaningful actions
at the right time.
• Fast forward to the present and technologies
like Hadoop give you the scale and flexibility
to store data before you know how you are
going to process it.
12
Benefits of Big Data
• Big Data analytics can reveal important patterns that
would otherwise go unnoticed.
• Taking the antidepressant Paxil together with the anti-
cholesterol drug Pravachol could result in diabetic
blood sugar levels. Discovered by
– (1) using a symptomatic footprint characteristic of very
high blood sugar levels obtained by analyzing thirty years
of reports in an FDA database, and
– (2) then finding that footprint in the Bing searches using an
algorithm that detected statistically significant
correlations. People taking both drugs also tended to enter
search terms (“fatigue” and “headache,” for example) that
constitute the symptomatic footprint.
Loss of Informational Privacy
• Informational privacy is the ability to determine for
ourselves what information about us others collect
and what they do with it.
• None of the developments just outlined can happen
without a loss of control over our data.
We Lose Control, They Gain It
Information
aggregators “We can
determine where
you work, how
Our data you spend your
time, and with
Businesses
whom, and with
87% certainty
where you'll be
next Thursday at
Government 5:35 p.m.”
New Privacy Problems?
• Changed privacy problems.
• A particularly complex and difficult tradeoff problem
takes center stage.
• Big Data presents a much wider range of both risks
and benefits—from detecting drug interactions to
reducing emergency room costs to improving police
response times.
Energy (Smart Meters)
Electricity Provider
Benefit
Constant s Efficient Energy
awareness about Faster outage Accurate Energy Unique insight on Management for
your energy bill detection Bill your consumption time based rates
Energy (Smart Meters)
When using such meters we
are not just the “consumers”
but a “product” for the
Electricity supply companies.
Initial Text Publishing Technique
• Data collection contains person-specific information.
• Sharing this collected information is valuable both in research
and business.
• Publishing this data may put a person’s privacy at risk.
Example: Hospital has some person specific patient data.
Data Leak!
21
How do we Ensure Privacy?
• Non-interactive protection
– A protected version of the original data set
is released.
– Example: k-Anonymity, l-diversity, t-
closeness.
Contributor Data Data
s Collector Analyst
• Interactive protection
– A user queried data analysis is performed
on the original data set and then a
protected version of the results is returned
to the user. Q
– Differential Privacy A1
1Q
• Apple’s iOS 10 2
A
Contributor 2
Data Data
s Collector Analyst
K-Anonymity (Sweeney, Latanya. "k-anonymity: A model for protecting privacy."
International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10.05 (2002): 557-570.)
• Change data in such a way that for each tuple in the resulting
table there are at least (k-1) other tuples with the same value
for the quasi-identifier.
– Replace the original value by a semantically consistent but less specific
value.
Quasi- Data
Identifier Identifier Non-Sensitive Data Sensitive Data
# Name Zip Age Nationality Condition
1 Kumar 13053 28 Indian Heart Disease
2 Bob 13057 29 American Heart Disease
3 Ivan 13063 35 Canadian Viral Infection
4 Umeko 13067 36 Japanese Cancer
K-Anonymity
• Generalization
ZIP ****
*
130*
*
1305 1306
* *
1305 1305 1306 1306
3 7 3 7
*
Ag Nationalit
e y *
<40
American Asian
<30 3*
{
1 13053 28 •Russian
• Lives
Livesin
inzip Heart
code Disease
zipcode 1 130** <30 * Heart Disease
2 13068 29 13053
13053
American Heart Disease 2 130** <30 * Heart Disease
3 13068 21 Japanese Viral Infection 3 130** <30 * Viral Infection
4 13053 23 American Viral Infection 4 130** <30 * Viral Infection
Ken
{
5 14853 50 Indian Cancer 5 148** >40 * Cancer
6 14853 55 Russian Heart Disease 6 148** >40 * Heart Disease
7 14850 47 American Viral Infection 7 148** >40 * Viral Infection
8 14850 49 American Viral Infection 8 148** >40 * Viral Infection
9 13053
10 13053
11 13068
12 13068
31
37
36
35
American
Indian
Japanese
American
Cancer
Cancer
Cancer
Bob
Cancer
{ 9 130**
10 130**
11 130**
12 130**
3*
3*
3*
3*
*
*
*
*
Cancer
Cancer
Cancer
Cancer
L-Diversity (Machanavajjhala, Ashwin, et al. "l-diversity: Privacy beyond k-anonymity."
ACM Transactions on Knowledge Discovery from Data (TKDD) 1.1 (2007): 3.)
=
A1: Smoking causes cancer with a
probability X
Cloud Security & Privacy - An Enterprise Perspective Securing the Cloud: Cloud Computer Security
on Risks and Compliance Techniques and Tactics
32