Mmdbms

Architecture and Implementation of Database Systems (Fall 2009)
Jens Teubner, Systems Group jens.teubner@inf.ethz.ch
Jens Teubner Systems Group, Department of Computer Science ETH Zrich u
Fall 2009
Part VII Databases on Modern Hardware
Fall 2009
299
Motivation
The techniques weve seen so far all built on the same assumptions: Query processing cost is dominated by disk I/O. Main memory is random-access memory. Access to main memory has negligible cost. Are these assumptions justified at all?
Fall 2009
300
Motivation
Lets have a look at a real, large-scale database: Amadeus IT Group is a major provider for travel-related IT. Core database: Global Distribution System (GDS): dozens of millions of flight bookings few kilobytes per booking several hundred gigabytes of data These numbers may sound impressive, but: The hot set of this database is significantly slower. Flights with near departure times are most interesting. My laptop already has four gigabytes of RAM. It is perfectly realistic to have the hot set in main memory.
Fall 2009 Jens Teubner Systems Group, Department of Computer Science ETH Zrich u 301
Row-Wise Storage
Remember the row-wise data layout we discussed in Chapter I: a1 b1 c1 d1 a2 b2 c2 d2 a3 b3 c3 d3 a4 b4 c4 d4

c1 d2 a1 b2 c3 d1 a3 b1 c2 d3 a2 b3 c1 d2 c4 a4 d4 b4 c4
page 0
page 1
Records in Amadeus ITINERARY table are 350 bytes, spanning over 47 attributes (i.e., 1030 records per page).
Fall 2009
302
Row-Wise Storage
To answer a query like SELECT * FROM ITINERARY WHERE FLIGHTNO = LX7 AND CLASS = M the system has to scan the entire ITINERARY table.18 The table probably wont fit into main memory as a whole. Though we always have to fetch full tables from disk, we will only inspect 2060 data items per page (to decide the predicate).
18
assuming there is no index support

Jens Teubner Systems Group, Department of Computer Science ETH Zrich u 303
Fall 2009
Column-Wise Storage
Compare this to a column-wise storage: a1 b1 c1 d1 a2 b2 c2 d2 a3 b3 c3 d3 a4 b4 c4 d4
a3 a1 a4 a2 a3 b3 b1 b4 b2 b3
page 0 page 1
We now have to evaluate the query in two steps: 1. Scan the pages that contain the FLIGHTNO and CLASS attributes. 2. For each matching tuple, fetch the 45 missing attributes from the remaining data pages.
Fall 2009
304
Column-Wise Storage
We read only a subset of the table, which may now fit into memory. We actually use hundreds or thousands of data items per page. But: We have to re-construct each tuple from 45 different pages. Column-wise storage particularly pays off if tables are wide (i.e., contain many columns), fetch
scan
there is no index support (in high-dimensional spaces, e.g., indexes become ineffective Chapter III), and queries have a high selectivity. OLAP workloads are the prototypical use case.
Example: MonetDB
The open-source database MonetDB19 pushes the idea of vertical decomposition to its extreme: All tables (binary association tables, BATs) have 2 columns.
ID 4711 1723 6381 NAME John Marc Betty SEX M M F OID 0 1 2 ID OID NAME OID SEX 4711 0 John 0 M 1723 1 Marc 1 M 6381 2 Betty 2 F
Columns that carry consecutive numbers (such as OID above) can be represented as virtual columns. They are only stored implicitly (tuple order). Reduces space consumption and allows positional lookups.
19
http://www.monetdb.org/
Fall 2009
Reduced Memory Footprint
With help of column-wise storage, the hot set of the database may better fit into main memory. In addition, it increases the effectiveness of compression. All values within a page belong to the same domain. Theres a high chance of redundancy in such pages. So, with all data in main memory, are we done already?
Fall 2009
307
ccess he-art h they than peed itude al ace test ber of r secarray disk en at llion. y untems, ranically disk. k are than es on er of all of prove ential e and
ingly, be to build a fully denormalized tablethat is, a table including each transaction along with all user infor-
tables combined). If data analysis is carried out in timestamp order but requires information from both tables,
MemoryComparing randomCost access in disk and memory. Figure 3. Access and sequential
Random, disk 316 values/sec
Sequential, disk
53.2M values/sec
Random, SSD
1924 values/sec
Sequential, SSD
42.2M values/sec
Random, memory
36.7M values/sec
Sequential, memory
358.2M values/sec 10 100 1000 104 105 106 107 108
Windows 2003 Server, 64 GB RM eight 15,000 rpm disks using RAID 5 Intel high-performance SATA SSD. A. Jacobs. The Pathologies of Big Data. Comm. of the ACM, 52(8), Aug. 2009.
Fall 2009
* Disk tests were carried out on a freshly booted machine (a Windows 2003 server with 64GB RAM and eight 15,000RPM SAS disks in RAID5 conguration) to eliminate the effect of operating-system disk caching. SSD test used a latest generation Intel high-performance SATA SSD.
Figure 4. Denormalizing a user information table.
308
Main Memory Access Cost

1000
pentium-m-1700.cache-miss-latency [32k] [2M] 1.7e+03
int data[arr_size]; for (int i = arr_size - 1; i >= 0; i -= stride) process (data[i]);

nanosecs per iteration
(123) 100
(209) 170 cycles per iteration
Memory access incurs a significant latency (209 CPU cycles here). (Multiple levels of) caches try to hide this latency. Latency is increasing over time.
Fall 2009
10 (6.47)
17 (11)
(1.76)
a Calibrator v0.9e (Stefan.Manegold@cwi.nl, www.cwi.nl/mnegold)
(3) 1.7 1k 4k 16k 64k 256k 1M 4M 16M 64M 256M memory range [bytes] stride: 256 {128} {64} 32 16 8 4
309
Memory Access Cost

Various caches lead to the situation that RAM is not random-access in todays systems. multi-level data caches (Intel x86: two levels20 , AMD: three levels), instruction caches, translation lookaside buffers (TLBs) (to speed-up virtual address translation). Novel database systems (sometimes called main-memory databases) include algorithms that are optimized for in-memory processing. To keep matters simple, they assume that all data always resides in main memory.
20
The new i7 processor line has an L3 cache, too.

Fall 2009
Optimizing for Cache Efciency

To access main memory, CPU caches, in a sense, play the role that the buffer manager played to access the disk. Use the same tricks to make good use of the caches. Data processing in blocks Choose block size to match the cache size now. Sequential access Explicit hardware support for sequential scans. Use prefetching if possible. E.g., x86 prefetchnta assembly instruction. What page size was in the buffer manager, is the cache line size in the CPU cache (e.g., 64 bytes).
Fall 2009
311
In-Memory Hash Join

Straightforward clustering may cause problems: H different clusters
312
How could we avoid these problems?
If H exceeds the number of TLB entries, clustering will thrash the TLB.
scan input relation
If H exceeds the number of cache lines, cache thrashing occurs.
cluster
Fall 2009
Radix Clustering
pass 1 57 17 03 47 92 81 20 06 96 37 66 75
001 001 011 111 100 001 100 110 000 101 010 001
pass 2 57 17 81 96 75 03 66 92 20 37 47 06
001 001 001 000 001 011 010 100 100 101 111 110
2 bits
1 bit
h2
h1
h2 h2 h2
96 57 17 81 75 66 03 92 20 37 06 47
000 001 001 001 001 010 011 100 100 101 110 111
h1 and h2 are the same hash function, but they look at different bits in the generated hash.
Radix Clustering Origin2000

TLB 25 6.0 L1 L2 25
16
1.0
Sun Ultra
L1 L2
one-pass clustering
TLB 5.0
Intel PC
TLB L1 25 L2 11.0
3. Partitioned Hash-Join
multi-pass clustering Origin2000

20 5.0 20
4.5
0 0
20
20 P=1 P=2
4.0
5.0 4.5
20
P=1
5 10 10.0 15 20 Sun Ultra number of9.0 4.0 radix-bits 20

P=2
0.0
3.5 4.0 clocks (in billions) 15 seconds 15 15 seconds seconds 3.0
4.0
clocks (in billions) 15 15 seconds seconds
3.5
2.5 3.0 2.0 2.5
(Vertical grid lines6.0 indicate, where

clocks (in billions)
3.0 10 2.0
18
2.5
10 10
10 10
2.0
1.5
optimized multi-pass Origin2000 9 9 FigureP=210: P=3 2.0 Execution Tim 8 P=1 8 P=1
5.0
TLB 8.0 L1 data L2 data CPU 7.0
3.5
3.0
15
2.0
seconds
10
4.0
1.5
3.0
seconds
5 1.0
5 5
1.0
5 5 5
1.0
2.0
1.0
0.5
0.5
0 0 5 10 15 20 number of radix-bits TLB L1 data L2 data CPU
0.0
0 0 0 0
5 5 1010 1515 2020 number of radix-bits number of radix-bits TLB (model) L1 data L2 data CPU
0.0 0.0
In our experiments, we foun misses do not play a role on e in our evaluation.

3 2 1
1.0 0.50.5
1.0
seconds
1.5
1.5
7 6 5 5 4 3 2 1
0 0 0 0.0 0.00.0 0 0 05 5 5 10 10 15 15 20 20 10 15 20 number of radix-bits number of radix-bits number of radix-bits resource stalls TLB (model) DCU misses CPU
0 0 0 0
5 num
Figure 11: Execution L1 Breakdown MB Execution Time Breakdown of Figure Cluster SGI Origin 2000, 250 MHz, 32 kBTimecache, 4of Radix-Cluster using optimal numbero cache. Radix 13: L2L1, or L2 cache analyze t To (C TLB entries, (Vertical grid lines indicate, where the number of clusters created equals the number of = 8M) S. Manegold, P. Boncz, and M. Kersten. Optimizing Main-Memory Joinexperim lines, respectively.) conduct two>21) XOR v) #dene HASH(v) ((v> >7) XOR (v> >13) XOR (v> series of on Modern Hardware. IEEE TKDE, vol. 14(4), Jul/Aug 2002. typedef struct { Figure 10: Execution Time Breakdown int Radix-Cluster using one pass (Cardinality = 8M) the generic ADT-lik of v1,v2; /* simplied binarytechniques. We conduct experimen First, u tuple */ we replaced Fall 2009 Jens Teubner Systems Group, Department of Computer Science ETH Zrich 314
Optimizing Instruction Cache Usage

Consider a query processor that uses tuple-wise pipelining: Each tuple is passed through the pipeline, before we process the next one. For eight tuples we obtain an execution trace ABCABCABCABCABCABCABCABC , where A, B, and C correspond to the code that implements the three operators. B A
Depending on the size of the code that implements A, B, and C, this can mean instruction cache thrashing.
Fall 2009
315
Optimizing Instruction Cache Usage

We can improve the effect of instruction caching if we do pipelining in larger chunks. E.g., four tuples at a time: AAAABBBBCCCCAAAABBBBCCCC . Three out of four executions of every operator will now find their instructions cached.21 MonetDB again pushes this idea to the extreme. Full tables are processed at once (full materialization).
What do you think about this approach?

Fall 2009
21 This assumes that A, B, and C fit into the instruction cache individually. A variation is to group operators, such that the code for each group fits into cache. Jens Teubner Systems Group, Department of Computer Science ETH Zrich u 316
New Classes of Hardware for Data Processing

Our group is actively working on the use of non-traditional hardware for database processing. Field-Programmable Gate Arrays (FPGAs) Programmable hardware: implement database queries directly in hardware.
http://www.systems.ethz.ch/research/projects/avalanche
Remote Direct Memory Access (RDMA) Hardware-accelerated network processing; use for distributed database systems
http://www.systems.ethz.ch/research/projects/data-cyclotron
Fall 2009
317
Field-Programmable Gate Arrays

An array of logic gates Functionality fully programmable Re-programmable after deployment (in the field) programmable hardware FPGAs can be configured to implement any logic circuit. Complexity bound by available chip space.
Fall 2009
318
From Query to Hardware Circuit

Glacier: Compile queries into hardware circuits.
SELECT Price, Volume registers data valid flag FROM Trades WHERE Symbol = "UBSN" AND Volume > 100000 logic gates Price,Volume & c c:(a,b) < b:(100000,Volume) = a:(Symbol,"UBSN") Trades input stream
Fall 2009
payload
(parallel wires)
Glacier
a
&
<
100,000 Volume Symbol
=
"UBSN"
Price
Trades
319
Use Case: Algorithmic Trading

Real-world use case: algorithmic trading challenge: high package rates, yet achieve low latency ( s) packets processed 100 % 75 % 50 % 25 % 0%
300,000 pkts/s 1,000,000 pkts/s 100 % 60 % 36 % 100 %
FPGA software (Linux 2.6)
data input rate Process data at wire speed. Shield CPU from high load ( 90 % data filtered out).
Join Processing in Data Cyclotron

Host H1 Host H2 R4
A RDM
R3R3R3R S1
R R4 4 S2 R4
RDMA
RD M A
R2 R2 R2 S0 R2
RDMA
Host H3
R5 S3 R5 R5 R5 S4 R0R0R R
0 0
Host H0
S5
A RDM
R1
R1 R1 R1 Host H5
Host H4 RDMA: join and rotate in parallel

RD M A

Mmdbms

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mmdbms

Uploaded by

Copyright:

Available Formats

Architecture and Implementation of Database Systems (Fall 2009)

Jens Teubner, Systems Group jens.teubner@inf.ethz.ch

Jens Teubner Systems Group, Department of Computer Science ETH Zrich u

Part VII Databases on Modern Hardware

Jens Teubner Systems Group, Department of Computer Science ETH Zrich u

Jens Teubner Systems Group, Department of Computer Science ETH Zrich u

Remember the row-wise data layout we discussed in Chapter I: a1 b1 c1 d1 a2 b2 c2 d2 a3 b3 c3 d3 a4 b4 c4 d4

Jens Teubner Systems Group, Department of Computer Science ETH Zrich u

assuming there is no index support

Jens Teubner Systems Group, Department of Computer Science ETH Zrich u

Reduced Memory Footprint

Jens Teubner Systems Group, Department of Computer Science ETH Zrich u

358.2M values/sec 10 100 1000 104 105 106 107 108

Figure 4. Denormalizing a user information table.

Jens Teubner Systems Group, Department of Computer Science ETH Zrich u

Main Memory Access Cost

pentium-m-1700.cache-miss-latency [32k] [2M] 1.7e+03

int data[arr_size]; for (int i = arr_size - 1; i >= 0; i -= stride) process (data[i]);

(209) 170 cycles per iteration

Jens Teubner Systems Group, Department of Computer Science ETH Zrich u

Memory Access Cost

The new i7 processor line has an L3 cache, too.

Optimizing for Cache Efciency

Jens Teubner Systems Group, Department of Computer Science ETH Zrich u

In-Memory Hash Join

 How could we avoid these problems?

scan input relation

If H exceeds the number of cache lines, cache thrashing occurs.

Jens Teubner Systems Group, Department of Computer Science ETH Zrich u

Radix Clustering Origin2000

multi-pass clustering Origin2000

5 10 10.0 15 20 Sun Ultra number of9.0 4.0 radix-bits 20

3.5 4.0 clocks (in billions) 15 seconds 15 15 seconds seconds 3.0

(Vertical grid lines6.0 indicate, where

clocks (in billions)

clocks (in billions)

TLB 8.0 L1 data L2 data CPU 7.0

0 0 5 10 15 20 number of radix-bits TLB L1 data L2 data CPU

In our experiments, we foun misses do not play a role on e in our evaluation.

clocks (in billions)

Optimizing Instruction Cache Usage

Jens Teubner Systems Group, Department of Computer Science ETH Zrich u

Optimizing Instruction Cache Usage

 What do you think about this approach?

New Classes of Hardware for Data Processing

Jens Teubner Systems Group, Department of Computer Science ETH Zrich u

Field-Programmable Gate Arrays

Jens Teubner Systems Group, Department of Computer Science ETH Zrich u

From Query to Hardware Circuit

Jens Teubner Systems Group, Department of Computer Science ETH Zrich u

Use Case: Algorithmic Trading

FPGA software (Linux 2.6)

Join Processing in Data Cyclotron

Host H4 RDMA: join and rotate in parallel

You might also like

How could we avoid these problems?

What do you think about this approach?