Professional Documents
Culture Documents
Mmdbms
Mmdbms
Fall 2009
Fall 2009
299
Motivation
The techniques weve seen so far all built on the same assumptions: Query processing cost is dominated by disk I/O. Main memory is random-access memory. Access to main memory has negligible cost. Are these assumptions justified at all?
Fall 2009
300
Motivation
Lets have a look at a real, large-scale database: Amadeus IT Group is a major provider for travel-related IT. Core database: Global Distribution System (GDS): dozens of millions of flight bookings few kilobytes per booking several hundred gigabytes of data These numbers may sound impressive, but: The hot set of this database is significantly slower. Flights with near departure times are most interesting. My laptop already has four gigabytes of RAM. It is perfectly realistic to have the hot set in main memory.
Fall 2009 Jens Teubner Systems Group, Department of Computer Science ETH Zrich u 301
Row-Wise Storage
page 0
page 1
Records in Amadeus ITINERARY table are 350 bytes, spanning over 47 attributes (i.e., 1030 records per page).
Fall 2009
302
Row-Wise Storage
To answer a query like SELECT * FROM ITINERARY WHERE FLIGHTNO = LX7 AND CLASS = M the system has to scan the entire ITINERARY table.18 The table probably wont fit into main memory as a whole. Though we always have to fetch full tables from disk, we will only inspect 2060 data items per page (to decide the predicate).
18
Fall 2009
Column-Wise Storage
Compare this to a column-wise storage: a1 b1 c1 d1 a2 b2 c2 d2 a3 b3 c3 d3 a4 b4 c4 d4
a3 a1 a4 a2 a3 b3 b1 b4 b2 b3
page 0 page 1
We now have to evaluate the query in two steps: 1. Scan the pages that contain the FLIGHTNO and CLASS attributes. 2. For each matching tuple, fetch the 45 missing attributes from the remaining data pages.
Fall 2009
304
Column-Wise Storage
We read only a subset of the table, which may now fit into memory. We actually use hundreds or thousands of data items per page. But: We have to re-construct each tuple from 45 different pages. Column-wise storage particularly pays off if tables are wide (i.e., contain many columns), fetch
scan
there is no index support (in high-dimensional spaces, e.g., indexes become ineffective Chapter III), and queries have a high selectivity. OLAP workloads are the prototypical use case.
Fall 2009 Jens Teubner Systems Group, Department of Computer Science ETH Zrich u 305
Example: MonetDB
The open-source database MonetDB19 pushes the idea of vertical decomposition to its extreme: All tables (binary association tables, BATs) have 2 columns.
ID 4711 1723 6381 NAME John Marc Betty SEX M M F OID 0 1 2 ID OID NAME OID SEX 4711 0 John 0 M 1723 1 Marc 1 M 6381 2 Betty 2 F
Columns that carry consecutive numbers (such as OID above) can be represented as virtual columns. They are only stored implicitly (tuple order). Reduces space consumption and allows positional lookups.
19
http://www.monetdb.org/
Jens Teubner Systems Group, Department of Computer Science ETH Zrich u 306
Fall 2009
With help of column-wise storage, the hot set of the database may better fit into main memory. In addition, it increases the effectiveness of compression. All values within a page belong to the same domain. Theres a high chance of redundancy in such pages. So, with all data in main memory, are we done already?
Fall 2009
307
ccess he-art h they than peed itude al ace test ber of r secarray disk en at llion. y untems, ranically disk. k are than es on er of all of prove ential e and
ingly, be to build a fully denormalized tablethat is, a table including each transaction along with all user infor-
tables combined). If data analysis is carried out in timestamp order but requires information from both tables,
MemoryComparing randomCost access in disk and memory. Figure 3. Access and sequential
Random, disk 316 values/sec
Sequential, disk
53.2M values/sec
Random, SSD
1924 values/sec
Sequential, SSD
42.2M values/sec
Random, memory
36.7M values/sec
Sequential, memory
Windows 2003 Server, 64 GB RM eight 15,000 rpm disks using RAID 5 Intel high-performance SATA SSD. A. Jacobs. The Pathologies of Big Data. Comm. of the ACM, 52(8), Aug. 2009.
Fall 2009
* Disk tests were carried out on a freshly booted machine (a Windows 2003 server with 64GB RAM and eight 15,000RPM SAS disks in RAID5 conguration) to eliminate the effect of operating-system disk caching. SSD test used a latest generation Intel high-performance SATA SSD.
308
(123) 100
Memory access incurs a significant latency (209 CPU cycles here). (Multiple levels of) caches try to hide this latency. Latency is increasing over time.
Fall 2009
10 (6.47)
17 (11)
(1.76)
a Calibrator v0.9e (Stefan.Manegold@cwi.nl, www.cwi.nl/mnegold)
(3) 1.7 1k 4k 16k 64k 256k 1M 4M 16M 64M 256M memory range [bytes] stride: 256 {128} {64} 32 16 8 4
309
Fall 2009
Fall 2009
311
If H exceeds the number of TLB entries, clustering will thrash the TLB.
cluster
Fall 2009
Radix Clustering
pass 1 57 17 03 47 92 81 20 06 96 37 66 75
001 001 011 111 100 001 100 110 000 101 010 001
pass 2 57 17 81 96 75 03 66 92 20 37 47 06
001 001 001 000 001 011 010 100 100 101 111 110
2 bits
1 bit
h2
h1
h2 h2 h2
96 57 17 81 75 66 03 92 20 37 06 47
000 001 001 001 001 010 011 100 100 101 110 111
h1 and h2 are the same hash function, but they look at different bits in the generated hash.
Fall 2009 Jens Teubner Systems Group, Department of Computer Science ETH Zrich u 313
16
1.0
Sun Ultra
L1 L2
one-pass clustering
TLB 5.0
Intel PC
TLB L1 25 L2 11.0
3. Partitioned Hash-Join
4.5
0 0
20
20 P=1 P=2
4.0
5.0 4.5
20
P=1
0.0
4.0
clocks (in billions) 15 15 seconds seconds
3.5
2.5 3.0 2.0 2.5
3.0 10 2.0
18
2.5
10 10
10 10
2.0
1.5
optimized multi-pass Origin2000 9 9 FigureP=210: P=3 2.0 Execution Tim 8 P=1 8 P=1
5.0
3.5
3.0
15
2.0
seconds
10
4.0
1.5
3.0
seconds
5 1.0
5 5
1.0
5 5 5
1.0
2.0
1.0
0.5
0.5
0.0
0 0 0 0
5 5 1010 1515 2020 number of radix-bits number of radix-bits TLB (model) L1 data L2 data CPU
0.0 0.0
1.0
seconds
1.5
1.5
7 6 5 5 4 3 2 1
0 0 0 0.0 0.00.0 0 0 05 5 5 10 10 15 15 20 20 10 15 20 number of radix-bits number of radix-bits number of radix-bits resource stalls TLB (model) DCU misses CPU
0 0 0 0
5 num
Figure 11: Execution L1 Breakdown MB Execution Time Breakdown of Figure Cluster SGI Origin 2000, 250 MHz, 32 kBTimecache, 4of Radix-Cluster using optimal numbero cache. Radix 13: L2L1, or L2 cache analyze t To (C TLB entries, (Vertical grid lines indicate, where the number of clusters created equals the number of = 8M) S. Manegold, P. Boncz, and M. Kersten. Optimizing Main-Memory Joinexperim lines, respectively.) conduct two>21) XOR v) #dene HASH(v) ((v> >7) XOR (v> >13) XOR (v> series of on Modern Hardware. IEEE TKDE, vol. 14(4), Jul/Aug 2002. typedef struct { Figure 10: Execution Time Breakdown int Radix-Cluster using one pass (Cardinality = 8M) the generic ADT-lik of v1,v2; /* simplied binarytechniques. We conduct experimen First, u tuple */ we replaced Fall 2009 Jens Teubner Systems Group, Department of Computer Science ETH Zrich 314
Depending on the size of the code that implements A, B, and C, this can mean instruction cache thrashing.
Fall 2009
315
21 This assumes that A, B, and C fit into the instruction cache individually. A variation is to group operators, such that the code for each group fits into cache. Jens Teubner Systems Group, Department of Computer Science ETH Zrich u 316
Remote Direct Memory Access (RDMA) Hardware-accelerated network processing; use for distributed database systems
http://www.systems.ethz.ch/research/projects/data-cyclotron
Fall 2009
317
Fall 2009
318
payload
(parallel wires)
Glacier
a
&
<
100,000 Volume Symbol
=
"UBSN"
Price
Trades
319
data input rate Process data at wire speed. Shield CPU from high load ( 90 % data filtered out).
Fall 2009 Jens Teubner Systems Group, Department of Computer Science ETH Zrich u 320
R3R3R3R S1
R R4 4 S2 R4
RDMA
RD M A
R2 R2 R2 S0 R2
RDMA
Host H3
R5 S3 R5 R5 R5 S4 R0R0R R
0 0
Host H0
S5
A RDM
R1
R1 R1 R1 Host H5
RD M A