01 Introduction

DATA3404
Scalable Data Management
A/Prof Uwe Roehm

School of Computer Science
DATA3404 ”Scalable Data Management" - 2023 (Roehm) 1

Scalable Data Management?

Example: Facebook
Source: Facebook
Some Facebook Statistics
In March 2022, Facebook has reported
1.96 billion active users/day worldwide.
Supported by a fewer thousand employees
Over two years in mid 2010s:

7x growth in raw user data.
Over Halloween weekend 2011:
1 billion photos were uploaded.
Infrastructure:
– data centers with n x 10,000 servers
– several specialised data stores
sharded MySQL (still?) database for
actual user database
http://en.wikipedia.org/wiki/Facebook/
http://www.socialbakers.com/facebook-statistics/
DATA3404 ”Scalable Data Management" - 2023 (Roehm) http://gigaom.com/cloud/facebook-shares-some-secrets-on-making-mysql-scale/ 4
Usage Scenario Facebook (ICDE 2010)
Slide from Facebook presentation at ICDE 2010 (https://www.slideshare.net/ragho/hive-icde-2010)

Usage Scenario Facebook (ICDE 2010)
Slide from Facebook presentation at ICDE 2010 (https://www.slideshare.net/ragho/hive-icde-2010)

Big Data Examples
– Customer
– Twitter Life Statistics: https://www.internetlivestats.com/twitter-statistics/
– Walkscore: https://www.walkscore.com
– Neighborland: https://neighborland.com
– Business
– Predictive Marketing: EDITED.com
– Journalism
– TimesMachine: https://timesmachine.nytimes.com/browser
– Panama Papers: https://panamapapers.sueddeutsche.de/en/
– Research
– Cern LHC open data access: https://opendata.cern.ch/
– SDDS SkyServer: https://skyserver.sdss.org/dr18/
– Personalities in the United States (cf Journal of American Psychological As.)
– Human Brain Project: https://www.humanbrainproject.eu
– Google Flu Trends: https://www.google.org/flutrends/about/
– Google Books nGrams: https://books.google.com/ngrams/
The Scale of Today’s Data
– It is common now to deal in units of petabytes (1015 )
– Some examples from Wikipedia (July 2010)
– Yahoo! claimed record with 1 Petabyte Database – back in 2005
– World of Warcraft uses 1.3 petabytes of storage (year?)
– Teradata database has capacity of 50 petabytes of compressed data
– Large Hadron Collider (LHC) generates 15 petabytes per year.
– Google processes 24 petabytes/day
– Manage and access large data
– Go to seek.com.au and search for the keyword “data architect”, “data
engineer” or “big data” – or “data scientist”
– This course will teach you to be a vendor-neutral “big data architect” /
DATA3404 ”Scalable Data“data systems
Management" - 2023 (Roehm)engineer” 8
What do we cover in DATA3404?

Grand Theme of DATA3404
– How to efficiently deal with SCALE?
– Large collections of data (hundreds of gigabytes)
• both structured (tuples)
• and unstructured (text or (key,value) pairs)
• we are interested on cases where data does not fit into memory….
– Shared access by large numbers of concurrent users (thousands)
– Availability – always ON
– Questions:
– How to efficiently manage large amounts of data?
– How to efficiently find data in those collections?
– How to efficiently serve thousands of concurrent users?

Key Principles
– Data Independence
– Applications are decoupled from structure of data
– Logical data model is decoupled from physical model
– Declarative Interface
– Specify “what rather than how.”
– Separate “interface from implementation.”
– Space can be reused but not time

– Speed-up lookups and joins using indexing and copies
– Scale-agnostic Design
– Local processing without global state that can be easily parallelized or
cloned/restarted on new nodes
What is a Data Management System?
– A Database is a collection of data central to some organisation or
enterprise
– Essential to operation of enterprise
– State of database mirrors state of enterprise
– An important asset on its own
– A Database Management System (DBMS) is a software package that

manages a database:
– Stores the database on some mass storage (+backup +recovery).
– Supports a high-level access language (e.g. SQL).
– Application describes database accesses using that language.
– DBMS interprets statements of language to perform requested database access.

Levels of Abstraction
– Many views, single conceptual (logical) View 1 View 2 View 3
schema and physical schema.
– Views describe how users see the data.
– Conceptual schema defines logical Conceptual Schema
structure
– Physical schema describes the files and Physical Schema
indexes used.
– In DATA3404, we will look at
– the physical layer
– the translation between conceptual and
physical schema, and
– cross-layer aspects of DBMS.
Knowledge of the database internals is the pre-requisite for effective performance tuning.
Data Independence
– Applications are insulated from how data is structured and stored.
– Logical data independence:
Protection from changes in logical (DDL) structure of data.
– Physical data independence:
Protection from changes in physical structure of data (storage system).
 One of the most important benefits of using a DBMS!
– DATA3404:
– Which physical design choices do we have available?
– What are the advantage / disadvantages of each structure?

The Nature of SQL
– In the “programming world” new languages are constantly being invented –
Java, C#, Python, Ruby, GO etc.
– In the “database world” attempts to replace SQL with other languages have
consistently failed – SQL ends up being extended.
– Why is SQL so versatile? => The declarative nature of SQL

– In modern parlance SQL was probably the first widely accepted language to
successfully decouple the interface design from the implementation.
– SQL syntax is “syntactic sugar” for relational language.
(“most practical thing is to have a good theory”)

SQL Example
– “Find all students that are enrolled in DATA3404.”
SELECT s.sid, s.name
FROM Students s, Enrolled e
WHERE s.sid = e.sid AND e.cid = ‘DATA3404’
– Given this formal ‘description’ of what we want, how does a DBMS

efficiently find the information?
– Is there an index? Shall we use it? How to join the two tables?
– Query Processing:
– Translation into internal representation
– Query optimization
– Query execution

Query Optimization
project sid,name
Are they all doing the same?
project sid,name Which plan is better? join
select cid=‘DATA3404’ select cid=‘DATA3404’

project sid,name
students enrolled
join
join
students enrolled
select cid=‘DATA3404’ project sid,name
enrolled students
Challenge: Multi-Core CPUs
– For example, some recent rack server (2U):
– up-to 2 x 18-core Intel® Xeon© CPUs
– up-to 1.5 TB RAM
– up-to 16 TB SDDs / HDDs
– 4 x Gigabit Ethernet
– optional 2x10Gig Ethernet
– Servers with 64 cores and more are available already

– These systems can execute multiple processes and queries in parallel on
different CPU cores
– But we have to synchronize them when they access shared data!
– The challenge: How to do this fast and correctly at the same time?
Scale-Up
– The traditional approach:
– To scale with increasing load,
buy more powerful, larger
hardware
• from single workstation
• to dedicated db server
• to large massive-parallel
database appliance
DATA3404 ”Scalable Data Management" - 2023 (Roehm) [source: Jim Gray, HPTS99] 19
The Alternative: Scale-Out
A single server has limits…
For real Big Data processing, need to
scale-out to a cluster of multiple servers (nodes):
[Source: Server.png from PinClipart.com]

State-of-the-Art:
shared-nothing architecture

Infrastructure of Scale
– Typical ‘cloud-size’ data center will have about 40-50k servers
– ‘commodity’ design with multiple cores
– virtualization to host multiple services on same shared hardware
=> Plunging cost of computing
– Multiple datacenters
– At scale multiple datacenters can be used
• Close to customer
• Cross data center data redundancy
• Address international markets efficiently
– Avoid massive upfront data cost & years to fully utilize

– Scale supports pervasive automation investment
Data Science Platforms – Big Picture
– Layered stack of frameworks for distributed data management and processing
– Many choices of distributed data processing platforms
Application
Data Processing MapReduce
Storage
Infrastructure
DATA3404 ”Scalable Data Management" - 2023 (Roehm) [slide by Ion Stoica, UCB, 2013] 22
Internal Structure of a DBMS
– A typical DBMS has a
Web Forms Application Front Ends SQL Interface
layered architecture
SQL Commands
– This is one of several Parser Plan Executor Query

Evaluation
possible architectures; Optimizer Operator Evaluator Engine
each system
has its own variations
File and Access Methods
Transaction
– Vertical functions Manager
Recovery
Buffer Manager
– concurrency control Manager
Lock
– recovery Manager
Disk Space Manager
DATA3404 ”Scalable Data Management" - 2023 (Roehm) Concurrency Control DBMS 23
DATABASE
Flink Data Processing System Stack
Source: http://ci.apache.org/projects/flink/flink-docs-release-0.8.1/internal_general_arch.html

Summary
– DBMS used to maintain & query large datasets
– Main Benefits:
– Program-Data Independence
– Controlled Data Redundancy
– Declarative Queries
– Data Science Platforms borrow core techniques from DBMS

– Storage, querying, distributed data management and processing
– DBMS techniques not only of interest to DBAs,
but for every SW developer facing large-scale data problems

Next Week
– Storage Layer
– Disks, Blocks and Files
– Buffer Management
– Row and page structures
– Data Compression
– “Row Store” vs. “Column Store”
– Readings:
– Hellerstein/Stonebraker/Hamilton:“Architecture of a DB System”, Sec 5
– Garcia-Molina/Ullman/Widom, Chapter 13 (skip section 4)
– Ramakrishnan/Gehrke, Chapter 9 (shorter overview in Ch.8)
– Kifer/Bernstein/Lewis, Chapter 9

01 Introduction

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

01 Introduction

Uploaded by

Copyright:

Available Formats

DATA3404

Scalable Data Management

A/Prof Uwe Roehm

DATA3404 ”Scalable Data Management" - 2023 (Roehm) 1

DATA3404 ”Scalable Data Management" - 2023 (Roehm) 2

Over two years in mid 2010s:

Slide from Facebook presentation at ICDE 2010 (https://www.slideshare.net/ragho/hive-icde-2010)

DATA3404 ”Scalable Data Management" - 2023 (Roehm) 5

Slide from Facebook presentation at ICDE 2010 (https://www.slideshare.net/ragho/hive-icde-2010)

DATA3404 ”Scalable Data Management" - 2023 (Roehm) 6

DATA3404 ”Scalable Data Management" - 2023 (Roehm) 9

DATA3404 ”Scalable Data Management" - 2023 (Roehm) 10

– Space can be reused but not time

– A Database Management System (DBMS) is a software package that

DATA3404 ”Scalable Data Management" - 2023 (Roehm) 12

DATA3404 ”Scalable Data Management" - 2023 (Roehm) 14

– Why is SQL so versatile? => The declarative nature of SQL

DATA3404 ”Scalable Data Management" - 2023 (Roehm) 15

– Given this formal ‘description’ of what we want, how does a DBMS

DATA3404 ”Scalable Data Management" - 2023 (Roehm) 16

project sid,name Which plan is better? join

select cid=‘DATA3404’ select cid=‘DATA3404’

– Servers with 64 cores and more are available already

[Source: Server.png from PinClipart.com]

DATA3404 ”Scalable Data Management" - 2023 (Roehm) 20

– Avoid massive upfront data cost & years to fully utilize

Data Processing MapReduce

– This is one of several Parser Plan Executor Query

DATA3404 ”Scalable Data Management" - 2023 (Roehm) 24

– Data Science Platforms borrow core techniques from DBMS

DATA3404 ”Scalable Data Management" - 2023 (Roehm) 25

You might also like