Yannis Kotidis: AT&T Labs-Research

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 39

Yannis Kotidis

AT&T Labs-Research
What is Data Warehouse?

•  “A data warehouse is a subject-oriented, integrated, time-variant, and


nonvolatile collection of data in support of management’s decision-
making process.”—W. H. Inmon
•  A Data Warehouse is used for On-Line-Analytical-Processing:
“Class of tools that enables the user to gain insight into data through
interactive access to a wide variety of possible views of the information”
•  3 Billion market worldwide [1999 figure, olapreport.com]
–  Retail industries: user profiling, inventory management
–  Financial services: credit card analysis, fraud detection
–  Telecommunications: call analysis, fraud detection

Yannis Kotidis 2
Data Warehouse Initiatives
•  Organized around major subjects, such as customer, product,
sales
–  integrate multiple, heterogeneous data sources
–  exclude data that are not useful in the decision support process
•  Focusing on the modeling and analysis of data for decision
makers, not on daily operations or transaction processing
–  emphasis is on complex, exploratory analysis not day-to-day
operations
•  Large time horizon for trend analysis (current and past data)
•  Non-Volatile store
–  physically separate store from the operational environment

Yannis Kotidis 3
Data Warehouse Architecture

•  Extract data from


operational data sources
–  clean, transform
•  Bulk load/refresh
–  warehouse is offline
•  OLAP-server provides
multidimensional view
•  Multidimensional-olap
(Essbase, oracle express)
•  Relational-olap
(Redbrick, Informix, Sybase,
SQL server)

Yannis Kotidis 4
Why do we need all that?
•  Operational databases are for On Line Transaction Processing
–  automate day-to-day operations (purchasing, banking etc)
–  transactions access (and modify!) a few records at a time
–  database design is application oriented
–  metric: transactions/sec
•  Data Warehouse is for On Line Analytical Processing (OLAP)
–  complex queries that access millions of records
–  need historical data for trend analysis
–  long scans would interfere with normal operations
–  synchronizing data-intensive queries among physically separated
databases would be a nightmare!
–  metric: query response time

Yannis Kotidis 5
Examples of OLAP
•  Comparisons (this period v.s. last period)
–  Show me the sales per region for this year and compare it to that of
the previous year to identify discrepancies
•  Multidimensional ratios (percent to total)
–  Show me the contribution to weekly profit made by all items sold in
the northeast stores between may 1 and may 7
•  Ranking and statistical profiles (top N/bottom N)
–  Show me sales, profit and average call volume per day for my 10
most profitable salespeople
•  Custom consolidation (market segments, ad hoc groups)
–  Show me an abbreviated income statement by quarter for the last
four quarters for my northeast region operations

Yannis Kotidis 6
Multidimensional Modeling
•  Example: compute total sales volume per product and store

Store

800

Product

Yannis Kotidis 7
Dimensions and Hierarchies
• A cell in the cube may store values (measurements) relative to the
combination of the labeled dimensions

Sales of DVDs in
NY in August DIMENSIONS
NY
DVD PRODUCT LOCATION TIME
category region year
product

product country quarter

state month week

city day
month
August

Yannis Kotidis store 8


Common OLAP Operations
•  Roll-up: move up the hierarchy
–  e.g given total sales per city, we
can roll-up to get sales per state PRODUCT LOCATION TIME
category region year
•  Drill-down: move down the
hierarchy product country quarter
–  more fine-grained aggregation
–  lowest level can be the detail
state month week
records (drill-through)
city day

store

Yannis Kotidis 9
Pivoting
•  Pivoting: aggregate on selected dimensions
–  usually 2 dims (cross-tabulation)

Yannis Kotidis 10
Slice and Dice Queries
•  Slice and Dice: select and project on one or more dimensions

customers

store
customer = “Smith”

Yannis Kotidis 11
Roadmap
•  What is a data warehouse and what it is for
•  What are the differences between OLTP and OLAP
•  Multi-dimensional data modeling
•  Data warehouse design
–  the star schema, bitmap indexes
•  The Data Cube operator
–  semantics and computation
•  Aggregate View Selection
•  Dynamic View Management
•  Other Issues

Yannis Kotidis 12
Data Warehouse Design
•  Most data warehouses adopt a star schema to represent the
multidimensional model
•  Each dimension is represented by a dimension-table
–  LOCATION(location_key,store,street_address,city,state,country,region)
–  dimension tables are not normalized
•  Transactions are described through a fact-table
–  each tuple consists of a pointer to each of the dimension-tables (foreign-
key) and a list of measures (e.g. sales $$$)

Yannis Kotidis 13
Star Schema Example
TIME
PRODUCT
time_key
product_key
day
product_name
day_of_the_week SALES category
month
time_key brand
quarter
color
year product_key supplier_name
location_key LOCATION
units_sold location_key
measures { amount
store
street_address
city
state
country
Yannis Kotidis region 14
Advantages of Star Schema
•  Facts and dimensions are clearly depicted
–  dimension tables are relatively static, data is loaded (append
mostly) into fact table(s)
–  easy to comprehend (and write queries)

“Find total sales per product-category in our stores in Europe”

SELECT PRODUCT.category, SUM(SALES.amount)


FROM SALES, PRODUCT,LOCATION
WHERE SALES.product_key = PRODUCT.product_key
AND SALES.location_key = LOCATION.location_key
AND LOCATION.region=“Europe”
GROUP BY PRODUCT.category
Yannis Kotidis 15
Star Schema Query Processing
TIME
PRODUCT
time_key
product_key
day
product_name
day_of_the_week SALES Pcategory category
month
time_key brand
quarter
color
year product_key supplier_name
location_key LOCATION
units_sold location_key
measures { amount
store
street_address
city
state
country
Yannis Kotidis Sregion=“Europe” region 16
Indexing OLAP Data: Bitmap Index
•  Each value in the column has a bit vector:
–  The i-th bit is set if the i-th row of the base table has the value for the
indexed column
–  The length of the bit vector: # of records in the base table
•  Mainly intended for small cardinality domains
LOCATION Index on Region

Yannis Kotidis 17
Join-Index
•  Join index relates the values of
the dimensions of a star schema to SALES
rows in the fact table. LOCATION
–  a join index on region region = Africa
maintains for each distinct region = America R102 1
region = Asia
region a list of ROW-IDs of
region = Europe
the tuples recording the sales
in the region R117 1
•  Join indices can span multiple R118 1
dimensions OR
–  can be implemented as bitmap-
indexes (per dimension) R124 1
–  use bit-op for multiple-joins

Yannis Kotidis 18
Problem Solved?
•  “Find total sales per product-category in our stores in Europe”
–  Join-index will prune ¾ of the data (uniform sales), but the
remaining ¼ is still large (several millions transactions)
•  Index is unclustered
LOCATON
•  High level aggregations are expensive!!!!!
region
–  long scans to get the data
–  hashing or sorting necessary for group-bys country

⇒Long Query Response Times state

city
⇒Pre-computation is necessary
store
Yannis Kotidis 19
Multiple Simultaneous Aggregates
4 Group-bys here:
(store,product)
Cross-Tabulation (products/store) (store)
(product)
()
Need to write 4 queries!!!

Sub-totals per store

Total sales
Sub-totals per product

Yannis Kotidis 20
The Data Cube Operator (Gray et al)
•  All previous aggregates in a single query:

SELECT LOCATION.store, SALES.product_key, SUM (amount)


FROM SALES, LOCATION
WHERE SALES.location_key=LOCATION.location_key
CUBE BY SALES.product_key, LOCATION.store

Challenge: Optimize Aggregate Computation

Yannis Kotidis 21
Relational View of Data Cube
Store Product_key sum(amout)
1 1 454
1 4 925
2 1 468
2 2 800
3 1 296
3 3 240
4 1 625
4 3 240
4 4 745
1 ALL 1379
SELECT LOCATION.store, SALES.product_key, SUM (amount)
1 ALL 1268
1 ALL 536
FROM SALES, LOCATION
1 ALL 1937
WHERE SALES.location_key=LOCATION.location_key ALL 1 1870
CUBE BY SALES.product_key, LOCATION.store ALL 2 800
ALL 3 780
ALL 4 1670
ALL ALL 5120
Yannis Kotidis 22
Data Cube: Multidimensional View
Total annual sales
Quarter of DVDs in America
1Qtr 2Qtr 3Qtr 4Qtr sum
DVD
PC America
VCR
sum

Region
Europe

Asia

sum

Yannis Kotidis 23
Other Extensions to SQL
•  Complex aggregation at multiple granularities (Ross et. all 1998)
–  Compute multiple dependent aggregates

SELECT LOCATION.store, SALES.product_key, SUM (amount)


FROM SALES, LOCATION
WHERE SALES.location_key=LOCATION.location_key
CUBE BY SALES.product_key, LOCATION.store: R
SUCH THAT R.amount = max(amount)

•  Other proposals: the MD-join operator (Chatziantoniou et. all 1999]

Yannis Kotidis 24
Data Cube Computation
•  Model dependencies among the aggregates:

most detailed “view”

product,store,quarter

can be computed from view


product,quarter store,quarter product, store
(product,store,quarter) by
summing-up all quarterly sales
quarter product store

none

Yannis Kotidis 25
Computation Directives
•  Hash/sort based methods (Agrawal et. al. VLDB’96)
1.  Smallest-parent
2.  Cache-results product,store,quarter

3.  Amortize-scans
4.  Share-sorts product,quarter store,quarter product, store

5.  Share-partitions
quarter product store

none

Yannis Kotidis 26
Alternative Array-based Approach
•  Model data as a sparse multidimensional array
–  partition array into chunks (a small sub-cube which fits in memory).
–  fast addressing based on (chunk_id, offset)
•  Compute aggregates in “multi-way” by visiting cube cells in the order
which minimizes the # of times to visit each cell, and reduces memory
access and storage cost.

What is the best


B traversing order to do
multi-way aggregation?

Yannis Kotidis 27
Reality check:too many views!
•  2n views for n
dimensions (no-
hierarchies)
•  Storage/update-
time explosion
•  More pre-
computation
doesn’t mean
better
performance!!!!

Yannis Kotidis 28
How to choose among the views?
•  Use some notion of benefit per view
•  Limit: disk space or maintenance-time

Hanarayan et al SIGMOD’96:
product,store,quarter

product,quarter store,quarter product, store


Pick views greedily until space is
quarter product store filled

none
Catch: quadratic in the number
of views, which is exponential!!!

Yannis Kotidis 29
•  Selection is based on a workload estimate (e.g. logs) and a given
constraint (disk space or update window)
•  NP-hard, optimal selection can not be computed > 4-5 dimensions
–  greedy algorithms (e.g. [Harinarayan96]) run at least in polynomial time in
the number of views i.e exponential in the number of dimensions!!!
•  Optimal selection can not be approximated [Karloff99]
–  greedy view selection can behave arbitrary bad
•  Lack of good models for a cost-based optimization!

Yannis Kotidis 30
Problem Generalization
•  View Management Problem: Materialize and maintain the right
subset of views with respect to the workload and the available
resources
•  What is the workload?
–  “Farmers” v.s. “Explorers” [Inmon99]
–  Pre-compiled queries (report generating tools, data mining)
–  Ad-hoc analysis (unpredictable)
•  What are the resources?
–  Disk space (getting cheaper)
–  Update window (getting smaller)

Yannis Kotidis 31
DynaMat: A Dynamic View
Management System
•  Continuous management based on disk space and update
window restrictions
•  Engage views whenever possible for incoming queries
–  e.g. infer monthly sales out of pre-computed daily sales
–  support both ad-hoc and pre-compiled queries
•  Exploit dependencies among the views to maintain the best
subset of them within the given update window

Yannis Kotidis 32
System Overview

•  Utilize a dedicated disk space (View Pool) for results of past queries
•  Engage stored results for answering new queries
–  Amortize query execution cost through multiple uses of the result

DW base tables

Query Aggregate Admission


User
Interface Locator Control

View Pool

Yannis Kotidis 33
The Space & Time Bounds

•  Pool utilization increases between updates


•  Space bound: new results compete with stored aggregates for the limited space
•  Time bound: results are evicted from the pool due to the update-time window

Space bound

Time bound

Yannis Kotidis 34
Dynamic View Management
•  Space and time restrictions will lead us
to evict materialized aggregates
•  Not a traditional caching problem
–  aggregates don’t have the same link f2

store
size,cost, cost/size f1
–  aggregates are not independent
customer
–  costs are dynamic

goodness(f) = accesses(f) / (t-tlast_access) * cost(f) / size(f)

size in pages
staleness
re-computation cost
number of accesses

Yannis Kotidis 35
Exploiting Dependencies For Updates
•  For each stored aggregate compute Deltas
minimum update cost UC(f)
–  incrementally from deltas Incremental
–  re-computation from father
•  shared maintenance cost
•  Total Update Cost:
Re-compute
updated f

store
results

customer

Yannis Kotidis 36
Roadmap
•  What is a data warehouse and what it is for
•  What are the differences between OLTP and OLAP
•  Multi-dimensional data modeling
•  Data warehouse design
–  the star schema, bitmap indexes
•  The Data Cube operator
–  semantics and computation
•  Aggregate View Selection
•  Dynamic View Management
•  Other Issues

Yannis Kotidis 37
Other Issues
•  Fact+Dimension tables in the DW are views of tables stored in
the sources
•  Lots of view maintenance problems
–  correctly reflect asynchronous changes at the sources
–  making views self-maintainable
•  Interactive queries (on-line aggregation)
–  e.g. show running-estimates + confidence intervals
•  Computing Iceberg queries efficiently
•  Approximation
–  rough-estimates for hi-level aggregates are often good-enough
–  histogram, wavelet, sampling based techniques (e.g. AQUA)

Yannis Kotidis 38
The End

• Thank you!

Yannis Kotidis 39

You might also like