BDP 2023 02

Big Data Processing
Jiaul Paik
Lecture 2
Introduction to Big Data
Acknowledgements: Some of the slides are taken from Jimmy Lin, University of Waterloo
Ubiquitous Questions!!!
What is big data?
Why big data?
How to deal with big data?

How much data?
Processes 20 PB a day (2008)
Crawls 20B web pages a day (2012)
Search index is 100+ PB (5/2014)
Bigtable serves 2000 petabytes, 600M QPS (5/2014)

How much data?
Hadoop: 10K nodes,

150K cores, 150 PB
(4/2014)
S3: 2T objects, 1.1M

request/second (4/2013)
How much data?
300 PB data in Hive +

600 TB/day (4/2014)
150 PB on 50k+ servers

running 15k apps (6/2011)
LHC: ~15 PB a year

Why big data? Science
Engineering
Commerce
Society
Source: Wikipedia (Everest)

Science
Data-intensive Science
Large Hadron collider: Particle collider
15 Petabyte/year
Maximilien Brice, © CERN

Engineering
The unreasonable effectiveness of data
Search, recommendation, prediction, …
Source: Wikipedia (Three Gorges Dam)

Language Translation
Focus of this course
Data Science
Tools
This Course
Analytics
Infrastructure
Execution
Infrastructure
“big data stack”

Buzzwords
Text: frequency estimation,
language models, inverted
data analytics, business indexes
Data Science
intelligence, OLAP, ETL, Graphs: graph traversals,
Tools
data warehouses and data random walks (PageRank)
lakes
This Course
Relational data: SQL, joins,
Analytics column stores
Infrastructure
Data mining: hashing,
clustering (k-means),
MapReduce, Spark, noSQL, classification,
Execution
Flink, Pig, Hive, Dryad, recommendations
Infrastructure
Pregel, Giraph, Storm
Streams: probabilistic data
structures (Bloom filters,
“big data stack” CMS, HLL counters)
This course focuses on algorithm design and “programming at scale”

What is the Goal of Big Data Processing?
• Finding useful pattern/insight/model from large data in
reasonable amount of time
• Primary focus is on efficiency, without losing the accuracy much
• Scalability of an algorithm
• Growth of its complexity with the problem size
• How well it can handle big data?

Two Common Routes to Scalability
1. Improving Algorithmic Efficiency
2. Parallel Processing
• Scale-up architecture: Powerful server with lots of RAM, disk and cpu-
cores
• Scale-out architecture: Cluster of low-cost computers

• Hadoop, Map-reduce, Spark
An Example: Data Clustering
Create 10000 clusters from 1 billion vectors of dimension 1000
STEP 1: Start with k initial cluster

centers (that is why k-mean )
STEP 2: Assign/cluster each

member to the closest center
Iterative
steps
STEP 3: Recalculate centers
K-means: Illustration
K-means: Initialize centers randomly
K-means: assign points to nearest center
K-means: readjust centers
K-means: assign points to nearest center
K-means: readjust centers
Kmeans: The Expensive part
Create 10000 clusters from 1 billion vectors of dimension 1000

centers (that is why k-mean )
STEP 2: Assign/cluster each

member to the closest center 1010 ×103 ×10 4 =1017
Iterative
steps
Solution 1: Improving Algorithmic Efficiency
Sampling based k-mean
• Take a random sample from the data
• Apply kmean on that data to produce the approximate centroids
• Key assumption:
• The centroids from the random samples are very close to the centroids of the
original data
Selective Search: Efficient and Effective Search of Large Textual Collections

by Kulkarni and Callan, ACM TOIS, 2016
Illustration: Sampling based k-mean
original centroids
Random sample
(30%)
kmean
Assign clusters to original

data
approx
centroids
Solution 2: Parallel Processing

centers (that is why k-mean ) 1. Split data into small chunks
STEP 2: Assign/cluster each 2. Process each chunk in

member to the closest center different cores / nodes in a
Iterative cluster
steps
Pros and Cons
• Sampling based method
• Pros: Fast processing
• Cons: Lossy Inference, Often Low accuracy
• Scale-up Architecture
• Pros: Fast processing (if data fits into RAM)
• Cons: Risk of data loss (system failure), Scalability issue for very large
data
• Scale-out architecture
• Pros: Can handle very large data, fault tolerant
• Cons: Communication bottleneck, difficulty in writing code
How to Tackle Big Data?
Source: Google
Divide and Conquer: the good, old and reliable friend
“Work”
Partition
w1 w2 w3
worker worker worker
r1 r2 r3
“Result” Combine
What are the Challenges?
• How do we assign work units to workers?
• What if we have more work units than workers?
• What if workers need to share partial results?
• How do we aggregate partial results?
• How do we know all the workers have finished?
• What if workers die?

A Critical Issue
• Parallelization problems arise from:

• Communication between workers (e.g., to exchange state)
• Access to shared resources (e.g., data)
• Thus, we need a synchronization mechanism

!!
d ti on
a
B iza
on
h r
n c
sy
Source: Ricardo Guimarães Herrmann

BDP 2023 02

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BDP 2023 02

Uploaded by

Copyright:

Available Formats

Big Data Processing

What is big data?

Why big data?

How to deal with big data?

Processes 20 PB a day (2008)

Crawls 20B web pages a day (2012)

Search index is 100+ PB (5/2014)

Bigtable serves 2000 petabytes, 600M QPS (5/2014)

Hadoop: 10K nodes,

S3: 2T objects, 1.1M

300 PB data in Hive +

150 PB on 50k+ servers

LHC: ~15 PB a year

Source: Wikipedia (Everest)

Maximilien Brice, © CERN

Source: Wikipedia (Three Gorges Dam)

“big data stack”

This course focuses on algorithm design and “programming at scale”

• Primary focus is on efficiency, without losing the accuracy much

• Growth of its complexity with the problem size

• How well it can handle big data?

1. Improving Algorithmic Efficiency

• Scale-out architecture: Cluster of low-cost computers

Create 10000 clusters from 1 billion vectors of dimension 1000

STEP 1: Start with k initial cluster

STEP 2: Assign/cluster each

Create 10000 clusters from 1 billion vectors of dimension 1000

STEP 1: Start with k initial cluster

STEP 2: Assign/cluster each

• Take a random sample from the data

• Apply kmean on that data to produce the approximate centroids

Selective Search: Efficient and Effective Search of Large Textual Collections

Assign clusters to original

STEP 1: Start with k initial cluster

STEP 2: Assign/cluster each 2. Process each chunk in

worker worker worker

• What if we have more work units than workers?

• What if workers need to share partial results?

• How do we aggregate partial results?

• How do we know all the workers have finished?

• What if workers die?

• Parallelization problems arise from:

• Thus, we need a synchronization mechanism

Source: Ricardo Guimarães Herrmann

You might also like