Professional Documents
Culture Documents
01 Introduction
01 Introduction
Source: Facebook
DATA3404 ”Scalable Data Management" - 2023 (Roehm) 3
Some Facebook Statistics
In March 2022, Facebook has reported
1.96 billion active users/day worldwide.
Supported by a fewer thousand employees
Infrastructure:
– data centers with n x 10,000 servers
– several specialised data stores
sharded MySQL (still?) database for
actual user database
http://en.wikipedia.org/wiki/Facebook/
http://www.socialbakers.com/facebook-statistics/
DATA3404 ”Scalable Data Management" - 2023 (Roehm) http://gigaom.com/cloud/facebook-shares-some-secrets-on-making-mysql-scale/ 4
Usage Scenario Facebook (ICDE 2010)
– Questions:
– How to efficiently manage large amounts of data?
– How to efficiently find data in those collections?
– How to efficiently serve thousands of concurrent users?
– Declarative Interface
– Specify “what rather than how.”
– Separate “interface from implementation.”
– Scale-agnostic Design
– Local processing without global state that can be easily parallelized or
cloned/restarted on new nodes
DATA3404 ”Scalable Data Management" - 2023 (Roehm) 11
What is a Data Management System?
– A Database is a collection of data central to some organisation or
enterprise
– Essential to operation of enterprise
– State of database mirrors state of enterprise
– An important asset on its own
– DATA3404:
– Which physical design choices do we have available?
– What are the advantage / disadvantages of each structure?
– Query Processing:
– Translation into internal representation
– Query optimization
– Query execution
students enrolled
join
join
students enrolled
select cid=‘DATA3404’ project sid,name
enrolled students
DATA3404 ”Scalable Data Management" - 2023 (Roehm) 17
Challenge: Multi-Core CPUs
– For example, some recent rack server (2U):
– up-to 2 x 18-core Intel® Xeon© CPUs
– up-to 1.5 TB RAM
– up-to 16 TB SDDs / HDDs
– 4 x Gigabit Ethernet
– optional 2x10Gig Ethernet
DATA3404 ”Scalable Data Management" - 2023 (Roehm) [source: Jim Gray, HPTS99] 19
The Alternative: Scale-Out
A single server has limits…
For real Big Data processing, need to
scale-out to a cluster of multiple servers (nodes):
– Multiple datacenters
– At scale multiple datacenters can be used
• Close to customer
• Cross data center data redundancy
• Address international markets efficiently
Application
Storage
Infrastructure
DATA3404 ”Scalable Data Management" - 2023 (Roehm) [slide by Ion Stoica, UCB, 2013] 22
Internal Structure of a DBMS
– A typical DBMS has a
Web Forms Application Front Ends SQL Interface
layered architecture
SQL Commands
DATABASE
Flink Data Processing System Stack
Source: http://ci.apache.org/projects/flink/flink-docs-release-0.8.1/internal_general_arch.html
– Readings:
– Hellerstein/Stonebraker/Hamilton:“Architecture of a DB System”, Sec 5
– Garcia-Molina/Ullman/Widom, Chapter 13 (skip section 4)
– Ramakrishnan/Gehrke, Chapter 9 (shorter overview in Ch.8)
– Kifer/Bernstein/Lewis, Chapter 9
DATA3404 ”Scalable Data Management" - 2023 (Roehm) 26