Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

Architecture and Implementation of Database Systems (Fall 2009)

Jens Teubner, Systems Group jens.teubner@inf.ethz.ch

Jens Teubner Systems Group, Department of Computer Science ETH Zrich u

Fall 2009

Part VII Databases on Modern Hardware

Fall 2009

Jens Teubner Systems Group, Department of Computer Science ETH Zrich u

299

Motivation

The techniques weve seen so far all built on the same assumptions: Query processing cost is dominated by disk I/O. Main memory is random-access memory. Access to main memory has negligible cost. Are these assumptions justified at all?

Fall 2009

Jens Teubner Systems Group, Department of Computer Science ETH Zrich u

300

Motivation
Lets have a look at a real, large-scale database: Amadeus IT Group is a major provider for travel-related IT. Core database: Global Distribution System (GDS): dozens of millions of flight bookings few kilobytes per booking several hundred gigabytes of data These numbers may sound impressive, but: The hot set of this database is significantly slower. Flights with near departure times are most interesting. My laptop already has four gigabytes of RAM. It is perfectly realistic to have the hot set in main memory.
Fall 2009 Jens Teubner Systems Group, Department of Computer Science ETH Zrich u 301

Row-Wise Storage

Remember the row-wise data layout we discussed in Chapter I: a1 b1 c1 d1 a2 b2 c2 d2 a3 b3 c3 d3 a4 b4 c4 d4


c1 d2 a1 b2 c3 d1 a3 b1 c2 d3 a2 b3 c1 d2 c4 a4 d4 b4 c4

page 0

page 1

Records in Amadeus ITINERARY table are 350 bytes, spanning over 47 attributes (i.e., 1030 records per page).

Fall 2009

Jens Teubner Systems Group, Department of Computer Science ETH Zrich u

302

Row-Wise Storage
To answer a query like SELECT * FROM ITINERARY WHERE FLIGHTNO = LX7 AND CLASS = M the system has to scan the entire ITINERARY table.18 The table probably wont fit into main memory as a whole. Though we always have to fetch full tables from disk, we will only inspect 2060 data items per page (to decide the predicate).
18

assuming there is no index support


Jens Teubner Systems Group, Department of Computer Science ETH Zrich u 303

Fall 2009

Column-Wise Storage
Compare this to a column-wise storage: a1 b1 c1 d1 a2 b2 c2 d2 a3 b3 c3 d3 a4 b4 c4 d4
a3 a1 a4 a2 a3 b3 b1 b4 b2 b3

page 0 page 1

We now have to evaluate the query in two steps: 1. Scan the pages that contain the FLIGHTNO and CLASS attributes. 2. For each matching tuple, fetch the 45 missing attributes from the remaining data pages.

Fall 2009

Jens Teubner Systems Group, Department of Computer Science ETH Zrich u

304

Column-Wise Storage
We read only a subset of the table, which may now fit into memory. We actually use hundreds or thousands of data items per page. But: We have to re-construct each tuple from 45 different pages. Column-wise storage particularly pays off if tables are wide (i.e., contain many columns), fetch

scan

there is no index support (in high-dimensional spaces, e.g., indexes become ineffective Chapter III), and queries have a high selectivity. OLAP workloads are the prototypical use case.
Fall 2009 Jens Teubner Systems Group, Department of Computer Science ETH Zrich u 305

Example: MonetDB
The open-source database MonetDB19 pushes the idea of vertical decomposition to its extreme: All tables (binary association tables, BATs) have 2 columns.
ID 4711 1723 6381 NAME John Marc Betty SEX M M F OID 0 1 2 ID OID NAME OID SEX 4711 0 John 0 M 1723 1 Marc 1 M 6381 2 Betty 2 F

Columns that carry consecutive numbers (such as OID above) can be represented as virtual columns. They are only stored implicitly (tuple order). Reduces space consumption and allows positional lookups.
19

http://www.monetdb.org/
Jens Teubner Systems Group, Department of Computer Science ETH Zrich u 306

Fall 2009

Reduced Memory Footprint

With help of column-wise storage, the hot set of the database may better fit into main memory. In addition, it increases the effectiveness of compression. All values within a page belong to the same domain. Theres a high chance of redundancy in such pages. So, with all data in main memory, are we done already?

Fall 2009

Jens Teubner Systems Group, Department of Computer Science ETH Zrich u

307

ccess he-art h they than peed itude al ace test ber of r secarray disk en at llion. y untems, ranically disk. k are than es on er of all of prove ential e and

ingly, be to build a fully denormalized tablethat is, a table including each transaction along with all user infor-

tables combined). If data analysis is carried out in timestamp order but requires information from both tables,

MemoryComparing randomCost access in disk and memory. Figure 3. Access and sequential
Random, disk 316 values/sec

Sequential, disk

53.2M values/sec

Random, SSD

1924 values/sec

Sequential, SSD

42.2M values/sec

Random, memory

36.7M values/sec

Sequential, memory

358.2M values/sec 10 100 1000 104 105 106 107 108

Windows 2003 Server, 64 GB RM eight 15,000 rpm disks using RAID 5 Intel high-performance SATA SSD. A. Jacobs. The Pathologies of Big Data. Comm. of the ACM, 52(8), Aug. 2009.
Fall 2009

* Disk tests were carried out on a freshly booted machine (a Windows 2003 server with 64GB RAM and eight 15,000RPM SAS disks in RAID5 conguration) to eliminate the effect of operating-system disk caching. SSD test used a latest generation Intel high-performance SATA SSD.

Figure 4. Denormalizing a user information table.

Jens Teubner Systems Group, Department of Computer Science ETH Zrich u

308

Main Memory Access Cost


1000

pentium-m-1700.cache-miss-latency [32k] [2M] 1.7e+03

int data[arr_size]; for (int i = arr_size - 1; i >= 0; i -= stride) process (data[i]);


nanosecs per iteration

(123) 100

(209) 170 cycles per iteration

Memory access incurs a significant latency (209 CPU cycles here). (Multiple levels of) caches try to hide this latency. Latency is increasing over time.
Fall 2009

10 (6.47)

17 (11)

(1.76)
a Calibrator v0.9e (Stefan.Manegold@cwi.nl, www.cwi.nl/mnegold)

(3) 1.7 1k 4k 16k 64k 256k 1M 4M 16M 64M 256M memory range [bytes] stride: 256 {128} {64} 32 16 8 4

Jens Teubner Systems Group, Department of Computer Science ETH Zrich u

309

Memory Access Cost


Various caches lead to the situation that RAM is not random-access in todays systems. multi-level data caches (Intel x86: two levels20 , AMD: three levels), instruction caches, translation lookaside buffers (TLBs) (to speed-up virtual address translation). Novel database systems (sometimes called main-memory databases) include algorithms that are optimized for in-memory processing. To keep matters simple, they assume that all data always resides in main memory.
20

The new i7 processor line has an L3 cache, too.


Jens Teubner Systems Group, Department of Computer Science ETH Zrich u 310

Fall 2009

Optimizing for Cache Efciency


To access main memory, CPU caches, in a sense, play the role that the buffer manager played to access the disk. Use the same tricks to make good use of the caches. Data processing in blocks Choose block size to match the cache size now. Sequential access Explicit hardware support for sequential scans. Use prefetching if possible. E.g., x86 prefetchnta assembly instruction. What page size was in the buffer manager, is the cache line size in the CPU cache (e.g., 64 bytes).

Fall 2009

Jens Teubner Systems Group, Department of Computer Science ETH Zrich u

311

In-Memory Hash Join


Straightforward clustering may cause problems: H different clusters
312

 How could we avoid these problems?

If H exceeds the number of TLB entries, clustering will thrash the TLB.

scan input relation

If H exceeds the number of cache lines, cache thrashing occurs.

cluster

Fall 2009

Jens Teubner Systems Group, Department of Computer Science ETH Zrich u

Radix Clustering
pass 1 57 17 03 47 92 81 20 06 96 37 66 75
001 001 011 111 100 001 100 110 000 101 010 001

pass 2 57 17 81 96 75 03 66 92 20 37 47 06
001 001 001 000 001 011 010 100 100 101 111 110

2 bits

1 bit

h2

h1

h2 h2 h2

96 57 17 81 75 66 03 92 20 37 06 47

000 001 001 001 001 010 011 100 100 101 110 111

h1 and h2 are the same hash function, but they look at different bits in the generated hash.
Fall 2009 Jens Teubner Systems Group, Department of Computer Science ETH Zrich u 313

Radix Clustering Origin2000


TLB 25 6.0 L1 L2 25

16

1.0
Sun Ultra
L1 L2

one-pass clustering
TLB 5.0

Intel PC
TLB L1 25 L2 11.0

3. Partitioned Hash-Join

multi-pass clustering Origin2000


20 5.0 20

4.5

0 0
20

20 P=1 P=2

4.0

5.0 4.5

20

P=1

5 10 10.0 15 20 Sun Ultra number of9.0 4.0 radix-bits 20


P=2

0.0

3.5 4.0 clocks (in billions) 15 seconds 15 15 seconds seconds 3.0

4.0
clocks (in billions) 15 15 seconds seconds

3.5
2.5 3.0 2.0 2.5

(Vertical grid lines6.0 indicate, where


clocks (in billions)

3.0 10 2.0

clocks (in billions)

18

2.5

10 10

10 10

2.0
1.5

optimized multi-pass Origin2000 9 9 FigureP=210: P=3 2.0 Execution Tim 8 P=1 8 P=1
5.0

clocks (in billions)

TLB 8.0 L1 data L2 data CPU 7.0

3.5

3.0

15

2.0

seconds

10

4.0

1.5

3.0

seconds

5 1.0

5 5

1.0

5 5 5

1.0
2.0

1.0
0.5

0.5

0 0 5 10 15 20 number of radix-bits TLB L1 data L2 data CPU

0.0

0 0 0 0

5 5 1010 1515 2020 number of radix-bits number of radix-bits TLB (model) L1 data L2 data CPU

0.0 0.0

In our experiments, we foun misses do not play a role on e in our evaluation.


3 2 1
1.0 0.50.5

1.0

seconds

1.5

clocks (in billions)

1.5

7 6 5 5 4 3 2 1

0 0 0 0.0 0.00.0 0 0 05 5 5 10 10 15 15 20 20 10 15 20 number of radix-bits number of radix-bits number of radix-bits resource stalls TLB (model) DCU misses CPU

0 0 0 0

5 num

Figure 11: Execution L1 Breakdown MB Execution Time Breakdown of Figure Cluster SGI Origin 2000, 250 MHz, 32 kBTimecache, 4of Radix-Cluster using optimal numbero cache. Radix 13: L2L1, or L2 cache analyze t To (C TLB entries, (Vertical grid lines indicate, where the number of clusters created equals the number of = 8M) S. Manegold, P. Boncz, and M. Kersten. Optimizing Main-Memory Joinexperim lines, respectively.) conduct two>21) XOR v) #dene HASH(v) ((v> >7) XOR (v> >13) XOR (v> series of on Modern Hardware. IEEE TKDE, vol. 14(4), Jul/Aug 2002. typedef struct { Figure 10: Execution Time Breakdown int Radix-Cluster using one pass (Cardinality = 8M) the generic ADT-lik of v1,v2; /* simplied binarytechniques. We conduct experimen First, u tuple */ we replaced Fall 2009 Jens Teubner Systems Group, Department of Computer Science ETH Zrich 314

Optimizing Instruction Cache Usage


Consider a query processor that uses tuple-wise pipelining: Each tuple is passed through the pipeline, before we process the next one. For eight tuples we obtain an execution trace ABCABCABCABCABCABCABCABC , where A, B, and C correspond to the code that implements the three operators. B A

Depending on the size of the code that implements A, B, and C, this can mean instruction cache thrashing.

Fall 2009

Jens Teubner Systems Group, Department of Computer Science ETH Zrich u

315

Optimizing Instruction Cache Usage


We can improve the effect of instruction caching if we do pipelining in larger chunks. E.g., four tuples at a time: AAAABBBBCCCCAAAABBBBCCCC . Three out of four executions of every operator will now find their instructions cached.21 MonetDB again pushes this idea to the extreme. Full tables are processed at once (full materialization).

 What do you think about this approach?


Fall 2009

21 This assumes that A, B, and C fit into the instruction cache individually. A variation is to group operators, such that the code for each group fits into cache. Jens Teubner Systems Group, Department of Computer Science ETH Zrich u 316

New Classes of Hardware for Data Processing


Our group is actively working on the use of non-traditional hardware for database processing. Field-Programmable Gate Arrays (FPGAs) Programmable hardware: implement database queries directly in hardware.
http://www.systems.ethz.ch/research/projects/avalanche

Remote Direct Memory Access (RDMA) Hardware-accelerated network processing; use for distributed database systems
http://www.systems.ethz.ch/research/projects/data-cyclotron

Fall 2009

Jens Teubner Systems Group, Department of Computer Science ETH Zrich u

317

Field-Programmable Gate Arrays


An array of logic gates Functionality fully programmable Re-programmable after deployment (in the field) programmable hardware FPGAs can be configured to implement any logic circuit. Complexity bound by available chip space.

Fall 2009

Jens Teubner Systems Group, Department of Computer Science ETH Zrich u

318

From Query to Hardware Circuit


Glacier: Compile queries into hardware circuits.
SELECT Price, Volume registers data valid flag FROM Trades WHERE Symbol = "UBSN" AND Volume > 100000 logic gates Price,Volume & c c:(a,b) < b:(100000,Volume) = a:(Symbol,"UBSN") Trades input stream
Fall 2009

payload
(parallel wires)

Glacier
a

&

<
100,000 Volume Symbol

=
"UBSN"

Price

Trades
319

Jens Teubner Systems Group, Department of Computer Science ETH Zrich u

Use Case: Algorithmic Trading


Real-world use case: algorithmic trading challenge: high package rates, yet achieve low latency ( s) packets processed 100 % 75 % 50 % 25 % 0%
300,000 pkts/s 1,000,000 pkts/s 100 % 60 % 36 % 100 %

FPGA software (Linux 2.6)

data input rate Process data at wire speed. Shield CPU from high load ( 90 % data filtered out).
Fall 2009 Jens Teubner Systems Group, Department of Computer Science ETH Zrich u 320

Join Processing in Data Cyclotron


Host H1 Host H2 R4
A RDM

R3R3R3R S1

R R4 4 S2 R4
RDMA

RD M A

R2 R2 R2 S0 R2
RDMA

Host H3

R5 S3 R5 R5 R5 S4 R0R0R R
0 0

Host H0

S5
A RDM

R1

R1 R1 R1 Host H5

Host H4 RDMA: join and rotate in parallel


Fall 2009 Jens Teubner Systems Group, Department of Computer Science ETH Zrich u 321

RD M A

You might also like