Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 28

Cube Computation and

Indexes for Data Warehouses


CPS 196.03
Notes 7

Processing
ROLAP servers vs. MOLAP servers
Index Structures
Cube computation
What to Materialize?
Algorithms

Client

Client

Query & Analysis

Metadata

Warehouse

Integration

Source

Source

Source

ROLAP Server

Relational OLAP Server

sale

prodId
p1
p2
p1

date
1
1
2

sum
62
19
48

tools

utilities

ROLAP
server

Special indices, tuning;


Schema is denormalized

relational
DBMS

MOLAP Server
ty

Multi-Dimensional OLAP Server


Sales

Ci

B
A

M.D. tools

Product

milk
soda
eggs
soap

utilities

multidimensional
server

2 3 4
Date

could also
sit on
relational
DBMS

Pr
od
u

TV
PC
VCR
sum

1Qtr

2Qtr

Date

3Qtr

4Qtr

sum

Total annual sales


of TV in U.S.A.
U.S.A
Canada
Mexico

Country

ct

MOLAP

sum

MOLAP

c3 61
62
63
64
c2 45
46
47
48
c1 29
30
31
32
c0

b3

b2

B13

14

15

16
28

24

b1

b0

a0

a1

a2

a3

20

44
40
36

60
56
52

Challenges in MOLAP

Storing large arrays for efficient access


Row-major,

column major

Chunking
Compressing

sparse arrays

Creating array data from data in tables


Efficient techniques for Cube computation

Topics are discussed in the paper for reading


7

Index Structures

Traditional Access Methods


B-trees,

hash tables, R-trees, grids,

Popular in Warehouses
inverted

lists
bit map indexes
join indexes
text indexes

Inverted Lists
18
19

20
21
22

23
25
26

age
index

r5
r19
r37
r40

rId
r4
r18
r19
r34
r35
r36
r5
r41

name age
joe
20
fred
20
sally
21
nancy 20
tom
20
pat
25
dave
21
jeff
26

...

20
23

r4
r18
r34
r35

inverted
lists

data
records
9

Using Inverted Lists

Query:
Get

people with age = 20 and name = fred

List for age = 20: r4, r18, r34, r35


List for name = fred: r18, r52
Answer is intersection: r18

10

Bit Maps

20
23

20
21
22

1
1
0
1
1
0
0
0
0

23
25
26

age
index

bit
maps

0
0
1
0
0
0
1
0
1
1

id
1
2
3
4
5
6
7
8

name age
joe
20
fred
20
sally
21
nancy 20
tom
20
pat
25
dave
21
jeff
26

...

18
19

data
records
11

Bitmap Index

Index on a particular column


Each value in the column has a bit vector: bit-op is fast
The length of the bit vector: # of records in the base table
The i-th bit is set if the i-th row of the base table has the value
for the indexed column
not suitable for high cardinality domains

Base table
Cust
C1
C2
C3
C4
C5

Region
Asia
Europe
Asia
America
Europe

Index on Region

Index on Type

Type RecIDAsia Europe America RecID Retail Dealer


Retail
1
1
0
1
1
0
0
Dealer 2
0
1
0
2
0
1
Dealer 3
3
0
1
1
0
0
Retail
4
0
0
1
4
1
0
0
1
0
5
0
1
Dealer 5

12

Using Bit Maps

Query:
Get

people with age = 20 and name = fred

List for age = 20: 1101100000


List for name = fred: 0100000001
Answer is intersection: 010000000000

Good if domain cardinality small


Bit vectors can be compressed

13

Join
Combine SALE, PRODUCT relations
In SQL: SELECT * FROM SALE, PRODUCT WHERE ...
sale

prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2

joinTb

date
1
1
1
1
2
2

prodId
p1
p2
p1
p2
p1
p1

amt
12
11
50
8
44
4

name
bolt
nut
bolt
nut
bolt
bolt

product

price
10
5
10
5
10
10

storeId
c1
c1
c3
c2
c1
c2

date
1
1
1
1
2
2

id
p1
p2

name price
bolt
10
nut
5

amt
12
11
50
8
44
4

14

Join Indexes
join index
product

sale

id
p1
p2

rId
r1
r2
r3
r4
r5
r6

name price
bolt
10
nut
5

jIndex
r1,r3,r5,r6
r2,r4

prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2

date
1
1
1
1
2
2

amt
12
11
50
8
44
4

15

Cube Computation for Data


Warehouses

16

Counting Exercise

How many cuboids are there in a cube?

The full or nothing case


When dimension hierarchies are present

What is the size of each cuboid?

17

Lattice of Cuboids
129

c1
67

p1

c2
12

c3
50

city

city, product
p1
p2

c1
56
11

c2
4
8

all

product

city, date

date

product, date

c3
50

day 2
day 1

c1
c2
c3
p1
44
4
p2 c1
c2
c3
p1
12
50
p2
11
8

city, product, date


18

Dimension Hierarchies
all

state

cities

city
c1
c2

state
CA
NY

city

19

Dimension Hierarchies
all
city

city, product

product

city, date

city, product, date

date

product, date
state
state, date
state, product
state, product, date

not all arcs shown...

20

Efficient Data Cube Computation

Data cube can be viewed as a lattice of cuboids

The bottom-most cuboid is the base cuboid

The top-most cuboid (apex) contains only one cell

How many cuboids in an n-dimensional cube with L


levels?
n
T ( Li 1)
i 1

Materialization of data cube

Materialize every (cuboid) (full materialization), none (no


materialization), or some (partial materialization)

Selection of which cuboids to materialize

Based on size, sharing, access frequency, etc.

21

Derived Data

Derived Warehouse Data


indexes
aggregates
materialized

views (next slide)

When to update derived data?


Incremental vs. refresh

22

Idea of Materialized Views

sale

Define new warehouse tables/arrays


prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2

joinTb

date
1
1
1
1
2
2

prodId
p1
p2
p1
p2
p1
p1

amt
12
11
50
8
44
4

name
bolt
nut
bolt
nut
bolt
bolt

product

price
10
5
10
5
10
10

storeId
c1
c1
c3
c2
c1
c2

date
1
1
1
1
2
2

id
p1
p2

amt
12
11
50
8
44
4

name price
bolt
10
nut
5

does not exist


at any source

23

Efficient OLAP Processing

Determine which operations should be performed on available cuboids

Transform drill, roll, etc. into corresponding SQL and/or OLAP operations,
e.g., dice = selection + projection

Determine which materialized cuboid(s) should be selected for OLAP:

Let the query to be processed be on {brand, province_or_state} with the


condition year = 2004, and there are 4 materialized cuboids available:
1) {year, item_name, city}
2) {year, brand, country}
3) {year, brand, province_or_state}
4) {item_name, province_or_state} where year = 2004
Which should be selected to process the query?

Explore indexing structures & compressed vs. dense arrays in MOLAP

24

What to Materialize?
Store in warehouse results useful for
common queries
Example:
total sales

day 2
day 1

c1
c2
c3
p1
44
4
p2 c1
c2
c3
p1
12
50
p2
11
8

p1
p2

materialize

c1
56
11

c2
4
8

c3
50

...

p1

c1
67

c2
12

c3
50

129
p1
p2

c1
110
19

25

Materialization Factors
Type/frequency of queries
Query response time
Storage cost
Update cost

Will study a concrete algorithm later

26

Iceberg Cube

Computing only the cuboid cells whose count


or other aggregates satisfying the condition
like
HAVING COUNT(*) >= minsup

Motivation

Only a small portion of cube cells may be above the


water in a sparse cube
Only calculate interesting cellsdata above certain
threshold

27

Challenges in MOLAP

Storing large arrays for efficient access


Row-major,

column major

Chunking
Compressing

sparse arrays

Creating array data from data in tables


Efficient techniques for Cube computation

Topics are discussed in the paper for reading


28

You might also like