Professional Documents
Culture Documents
BDP 2023 02
BDP 2023 02
Jiaul Paik
Lecture 2
Introduction to Big Data
Acknowledgements: Some of the slides are taken from Jimmy Lin, University of Waterloo
Ubiquitous Questions!!!
Data-intensive Science
Large Hadron collider: Particle collider
15 Petabyte/year
Data Science
Tools
This Course
Analytics
Infrastructure
Execution
Infrastructure
This Course
Relational data: SQL, joins,
Analytics column stores
Infrastructure
Data mining: hashing,
clustering (k-means),
MapReduce, Spark, noSQL, classification,
Execution
Flink, Pig, Hive, Dryad, recommendations
Infrastructure
Pregel, Giraph, Storm
Streams: probabilistic data
structures (Bloom filters,
“big data stack” CMS, HLL counters)
• Scalability of an algorithm
2. Parallel Processing
• Scale-up architecture: Powerful server with lots of RAM, disk and cpu-
cores
• Key assumption:
• The centroids from the random samples are very close to the centroids of the
original data
original centroids
Random sample
(30%)
kmean
approx
centroids
Solution 2: Parallel Processing
• Scale-up Architecture
• Pros: Fast processing (if data fits into RAM)
• Cons: Risk of data loss (system failure), Scalability issue for very large
data
• Scale-out architecture
• Pros: Can handle very large data, fault tolerant
• Cons: Communication bottleneck, difficulty in writing code
How to Tackle Big Data?
Source: Google
Divide and Conquer: the good, old and reliable friend
“Work”
Partition
w1 w2 w3
r1 r2 r3
“Result” Combine
What are the Challenges?
• How do we assign work units to workers?