Download as pdf or txt
Download as pdf or txt
You are on page 1of 57

Islamic Republic Of Afghanistan

Ministry Of Higher Education


Herat University
Computer Science Faculty
Advance Studies in
Database Systems 1 (Semester 7)

(Data Warehousing and Data Mining )


Lecture 5
MOLAP & ROLAP

LECTURER: HAMED AMIRY

hamed.amir.2015@gmail.com

DATAWAREHOUSING & DATAMINING-LECTURE 5 1


Outline of today’s class
 DW & OLAP
 OLAP facts and dimension
 OLAP FASMI test
 MOLAP Implementation
 MOLAP issues
 ROLAP Implementation
 ROLAP issues
 Federated Databases
 Logical database design

2 Datawarehousing & Datamining-Lecture 5


DW & OLAP
 Relationship between DWH & OLAP

 Data Warehouse & OLAP go together.

 Analysis supported by OLAP

3 Datawarehousing & Datamining-Lecture 5


Supporting the human thought process
THOUGHT PROCESS QUERY SEQUENCE

An enterprise wide fall in profit What was the quarterly sales during
last year ??


Profit down by a large percentage What was the quarterly sales at
consistently during last quarter only. regional level during last year ??
Rest is OK

What was the quarterly sales at


product level during last year?
What is special about last quarter ?

What was the monthly sale for last


quarter group by products
Products alone doing OK, but North
region is most problematic.
What was the monthly sale for last
quarter group by region
OK. So the problem is the high cost of
products purchased
in north. What was the monthly sale of
products in north at store level group
by products purchased
 How many such query sequences can be programmed in advance?
4 Datawarehousing & Datamining-Lecture 5
Analysis of last example
 Analysis is Ad-hoc
 Analysis is interactive (user driven)
 Analysis is iterative
 Answer to one question leads to a dozen more

 Analysis is directional
 Drill Down

 Roll Up More in
subsequent
slides
 Pivot

5 Datawarehousing & Datamining-Lecture 5


Challenges with MIS…
 Fails to remain user_driven (becomes programmer driven).
 Fails to remain ad_hoc and hence is not interactive.
 Enable ad-hoc query support in MIS systems
 Business user can not build his/her own queries (does not know SQL,
should not know it).
 Contradiction
 Want to compute answers in advance, but don't know the questions

 Solution
 Compute answers to “all” possible “queries”. But how?
 NOTE: Queries are multidimensional aggregates at some level

6 Datawarehousing & Datamining-Lecture 5


“All” possible queries (level aggregates)
ALL ALL

Province Kabul ... Herat

District Paghman... Bagrami Ghoryan ... Enjil

City Kamari Ghoryan

Zone Zone 1 ... Zone 2

7 Datawarehousing & Datamining-Lecture 5


OLAP: Facts & Dimensions
 The foundation for design in this environment is through
use of dimensional modeling techniques

 FACTS: Quantitative values (numbers) or “measures.”


 e.g., units sold, sales $, Co, Kg etc.

 DIMENSIONS: Descriptive categories.


 How we filter/report on the quantities e.g., time, geography,
product etc.
 DIM often organized in hierarchies representing levels of detail
in the data (e.g., week, month, quarter, year, decade etc.).

8 Datawarehousing & Datamining-Lecture 5


Where Does OLAP Fit In?
 It is a classification of applications, NOT a database design
technique.
 OLAP is a characterization of the application domain centered
around slice-and-dice and drill down analysis

 Analytical processing uses multi-level aggregates, instead


of record level access.

 Objective is to support very


 fast
 iterative and
 ad-hoc decision-making.

9 Datawarehousing & Datamining-Lecture 5


Where does OLAP fit in?

?

Transaction
Data
Data
Loading

OLAP

Reports

Decision
Maker
Data Cube
(MOLAP) Presentation
Tools

10 Datawarehousing & Datamining-Lecture 5


OLAP FASMI Test
 Fast: Delivers information to the user at a fairly constant rate.
Most queries answered in under five seconds.
 Analysis: Performs basic numerical and statistical analysis of
the data, pre-defined by an application developer or defined
ad-hocly by the user.
 Shared: Implements the security requirements necessary for
sharing potentially confidential data across a large user
population.
 Multi-dimensional: The essential characteristic of OLAP.
 Information: Accesses all the data and information necessary
and relevant for the application, wherever it may reside and
not limited by volume.
.
...from the OLAP Report by Pendse and Creeth

11 Datawarehousing & Datamining-Lecture 5


OLAP Implementations

 MOLAP: OLAP implemented with a multi-dimensional


data structure.

 ROLAP: OLAP implemented with a relational database.

12 Datawarehousing & Datamining-Lecture 5


MOLAP Implementations
 OLAP has historically been implemented using a
multi_dimensional data structure or “cube”.

 Dimensions are key business factors for analysis:


 Geographies (city, district, division, province,...)
 Products (item, product category, product department,...)
 Dates (day, week, month, quarter, year,...)
 Typically dimensions are the features on which the WHERE clause is
executed

 Very high performance achieved by O(1) time lookup into


“cube” data structure to retrieve pre_aggregated results.
 O(1) means that the execution time of the algorithm does not depend on the size of the input. Its execution time is
constant.
 Violates O(1) when the cube size is so large that it can not fit in the main memory

13 Datawarehousing & Datamining-Lecture 5


MOLAP Implementations
 No standard query language for querying MOLAP
 No SQL !
 there are no traditional relational structures

 Vendors provide proprietary languages allowing business users


to create queries that involve pivots, drilling down, or rolling
up.
 E.g. MDX of Microsoft

 Languages generally involve extensive visual (click and drag) support.

 Application Programming Interface (API)’s also provided for probing


the cubes.
 for running more complex quires

14 Datawarehousing & Datamining-Lecture 5


Aggregations in MOLAP
 Sales volume as a function of (i) product, (ii) time, and (iii)
geography

 A cube structure created to handle this.


S
Dimensions: Product, Geography, Time W
E
Hierarchical summarization paths N
Milk
Industry Province Year 23

Product
Bread 8
Category Division Quarter Eggs 45
Butter 13
Product District Month Week 12
Jam
Juice 10
City Day
w1 w2 w3 w4 w5 w6

15 Zone Datawarehousing & Datamining-Lecture 5


Time
Cube operations
 Drill down: get more details
 e.g., given summarized sales as above, find breakup of sales by
city within each region

 Rollup: summarize data


 e.g., given sales data, summarize sales for last year by product
category and region

 Slice and dice: select and project


 e.g.: Sales of soft-drinks in Herat during last quarter

 Pivot: change the view of data

16 Datawarehousing & Datamining-Lecture 5


Slicing and Dicing
is a combination of looking at a subset of data based on
more than one dimension

Red

Red
Blue
Blue WA WA
OR OR
Gray Gray
CA CA
Jul Aug Sep Jul Aug Sep

WA
Blue
Total OR
Blue
17
Jul Aug Sep Datawarehousing & Datamining-Lecture 5
CA
Jul Aug Sep
Querying the Data Cube
Number of Autos Sold
 Cross-tabulation
 “Cross-tab” for short CA OR WA Total
 Report data grouped by 2 Jul 45 33 30 108
dimensions
Aug 50 36 42 128
 Aggregate across other
dimensions Sep 38 31 40 109
 Include subtotals Total 133 100 112 345
 Operations on a cross-tab
 Roll up (further aggregation)
 Drill down (less aggregation)

18 Datawarehousing & Datamining-Lecture 5


Roll Up and Drill Down
Number of Autos Sold
Number of Autos Sold
CA OR WA Total
CA OR WA Total
Jul 45 33 30 108
133 100 112 345
Aug 50 36 42 128 Roll up
Sep 38 31 40 109 by Month Drill down
by Color
Total 133 100 112 345
Number of Autos Sold

CA OR WA Total

Red 40 29 40 109
Blue 45 31 37 113
Gray 48 40 35 123
19 Total 133
Datawarehousing & Datamining-Lecture 5 100 112 345
“Standard” Data Cube Query
 Measurements
 Which fact(s) should be reported?
 Filters
 What slice(s) of the cube should be used?
 Grouping attributes
 How finely should the cube be diced?
 Each dimension is either:
 (a) A grouping attribute
 (b) Aggregated over (“Rolled up” into a single total)
 n dimensions → 2n sets of grouping attributes
 Aggregation = projection to a lower-dimensional subspace

20 Datawarehousing & Datamining-Lecture 5


Full Data Cube with Subtotals
 Pre-computation of aggregates → fast answers to
OLAP queries
 Ideally, pre-compute all 2n types of subtotals
 Otherwise, perform aggregation as needed
 Coarser-grained totals can be computed from finer-
grained totals
 But not the other way around
 Daily totals can compute weekly totals but not other way
around

21 Datawarehousing & Datamining-Lecture 5


Data Cube Lattice

State, Month,
Color

State, State, Month,


Month Color Color
Drill Roll
Down Up
State Month Color

Total
22 Datawarehousing & Datamining-Lecture 5
MOLAP evaluation
 Advantages of MOLAP:
 Instant response (pre-calculated aggregates).
 Impossible to ask question without an answer.

 Drawbacks of MOLAP:
 Long load time ( pre-calculating the cube may take days!).
 "curse of dimensionality" i.e. as the number of dimensions increases,
the number of possible aggregates increases exponentially

 Very sparse cube (null aggregates) lead to wastage of space


for high cardinality (dimensions).
e.g. number of warm jackets sold in summer.
 Requires special-purpose data store

23 Datawarehousing & Datamining-Lecture 5


Sparsity
 Imagine a data warehouse for Safeway.
 Suppose dimensions are: Customer, Product, Store, Day
 If there are 100,000 customers, 10,000 products, 1,000 stores,
and 1,000 days…
 …data cube has 1,000,000,000,000,000 cells!
 Fortunately, most cells are empty.
 A given store doesn’t sell every product on every day.
 A given customer has never visited most of the stores.
 A given customer has never purchased most products.
 Multi-dimensional arrays are not an efficient way to store
sparse data.

24 Datawarehousing & Datamining-Lecture 5


MOLAP Implementation issues
 Maintenance issue: Every new/modified data item received
must be aggregated(aggregates needs to run again) into every
cube (assuming “up to-date” summaries are maintained). Lot
of work.

 Storage issue: As dimensions get less detailed (e.g., year vs.


day) cubes get much smaller, but storage consequences for
building hundreds of cubes need to be considered
 as each possible combination of dimensions has to be pre - calculated
 so soon faced with a combinatorial explosion
 So Lot of space
 Scalability: Often have difficulty scaling when the size of
dimensions becomes large
 but limited solutions addressing the scalability problem i.e. Virtual cubes
and partitioned cubes

25 Datawarehousing & Datamining-Lecture 5


Partitioned Cubes
 To overcome the space limitation of MOLAP, the cube is
partitioned.

 The divide&conquer cube partitioning approach helps alleviate


the scalability limitations of MOLAP implementation.

 One logical cube of data can be spread across multiple physical


cubes on separate (or same) servers.

 Ideal cube partitioning is completely invisible to end users.

 Performance degradation does occurs in case of a join across


partitioned cubes.

26 Datawarehousing & Datamining-Lecture 5


Partitioned Cubes: How it looks Like?
Men’s clothing
Children clothing
Women’s clothing

Time

Product

Geography

 Sales data cube partitioned at a major cotton products sale outlet

27 Datawarehousing & Datamining-Lecture 5


Virtual Cubes
 Used to query two dissimilar cubes by creating a third
“virtual” cube by a join between two cubes.

 Logically similar to a relational view i.e. linking two (or


more) cubes along common dimension(s).

 Biggest advantage is saving in space by eliminating


storage of redundant information.

 Example: Usually the sale price of different items varies based on the geography
and the time of the year. Instead of storing this information using an additional
dimension, the said price is calculated at run time, this may be slow, but can result
in tremendous saving in space, as all the items are not sold throughout the year.

28 Datawarehousing & Datamining-Lecture 5


Relational OLAP (ROLAP)

29 Datawarehousing & Datamining-Lecture 5


Why ROLAP?
 Issue of scalability i.e. curse of dimensionality for MOLAP
 Deployment of significantly large dimension tables as compared to
MOLAP using secondary storage.
 Aggregate awareness allows using pre-built summary tables by some
front-end tools.
 smart enough to develop or compute higher level aggregates using lower
level or more detailed aggregates.
 Star schema designs usually used to facilitate ROLAP querying
 Advantages:
 Scales well to high dimensionality
 Scales well to large data sets
 Sparsity is not a problem
 Uses well-known, mature technology
 Disadvantages:
 Query performance is slower than MOLAP
 Need to construct explicit indexes

30 Datawarehousing & Datamining-Lecture 5


ROLAP as a “Cube”
 OLAP data is stored in a relational database (e.g. a star
schema)
 The fact table is a way of visualizing as a “un-rolled” cube.
 So where is the cube?
 It’s a matter of perception
 Visualize the fact table as an elementary cube.

Product
Fact Table

Month Product Zone Sale K Rs.


M1 P1 Z1 250
M2 P2 Z1 500
31 Datawarehousing & Datamining-Lecture 5
Time
How to create “Cube” in ROLAP
 Cube is a logical entity containing values of a certain fact
at a certain aggregation level at an intersection of a
combination of dimensions.
 The following table can be created using 3 queries
Month_ID
SUM M1 M2 M3 ALL
(Sales_Amt)
P1
Product_ID

Part 2 P
ar
P2 t
P3 1

Total Part 3

32 Datawarehousing & Datamining-Lecture 5


How to create “Cube” in ROLAP using SQL
 For the table entries, without the totals
 SELECT S.Month_Id, S.Product_Id,
SUM(S.Sales_Amt)
 FROM Sales
 GROUP BY S.Month_Id, S.Product_Id;

 For the row totals


 SELECT S.Product_Id, SUM (Sales_Amt)
 FROM Sales
 GROUP BY S.Product_Id;

 For the column totals


 SELECT S.Month_Id, SUM (Sales)
 FROM Sales
 GROUP BY S.Month_Id;

33 Datawarehousing & Datamining-Lecture 5


Problem With Simple Approach
 Number of required queries increases exponentially with
the increase in number of dimensions.
 Its wasteful to compute all queries.
 In the example, the first query can do most of the work of the
other two queries
 If we could save that result and aggregate over Month_Id and
Product_Id, we could compute the other queries more
efficiently

 Better Solution:
 CUBE & ROLLUP (sql operator)

34 Datawarehousing & Datamining-Lecture 5


CUBE & ROLLUP
 The CUBE clause is part of SQL:1999
 GROUP BY CUBE (v1, v2, …, vn)
 Equivalent to a collection of GROUP BYs, one for each of the
subsets of v1, v2, …, vn
 CUBE computes entire lattice
 ROLLUP computes one path through lattice
 Order of GROUP BY list matters
 Groups by all prefixes of the GROUP BY list
GROUP BY ROLLUP(A,B,C) GROUP BY CUBE(A,B,C)
•A,B,C •A,B,C
•(A,B) subtotals •Subtotals for the following:
•(A) subtotals (A,B), (A,C), (B,C),
•Total (A), (B), (C)
•Total
35 Datawarehousing & Datamining-Lecture 5
Cube example

SELECT S.Month_Id, S.Product_Id, SUM(S.Sales_Amt)


FROM Sales
GROUP BY CUBE(S.Month_Id, S.Product_Id);

36 Datawarehousing & Datamining-Lecture 5


ROLAP & Space Requirement
 OK so we worked smart and got around the problem of
aggregate generation. But the aggregates once generated
have to be stored somewhere too i.e. in tables
 If one is not careful, with the increase in number of
dimensions, the number of summary tables gets very
large
 Consider the example discussed earlier with the
following two dimensions on the fact table...
 Time: Day, Week, Month, Quarter, Year, All Days
 Product: Item, Sub-Category, Category, All Products

37 Datawarehousing & Datamining-Lecture 5


EXAMPLE: ROLAP & Space Requirement
A naive implementation will require all combinations of summary tables
at each and every aggregation level.

 

24 summary tables, add in geography, results
in 120 tables
- The largest aggregate will be the Summarization by day
and product because this is the most detailed
- Smart tools will allow less detailed aggregates to be constructed
from more detailed aggregates (full aggregate awareness)
38
Datawarehousing & Datamining-
ROLAP Issues

 Maintenance.
 Non standard hierarchy of dimensions.
 Non standard conventions.
 Explosion of storage space requirement.

39 Datawarehousing & Datamining-Lecture 5


ROLAP Issue: Maintenance
 Summary tables are mostly a maintenance issue (similar
to MOLAP) than a storage issue.

 Notice that summary tables get much smaller as


dimensions get less detailed (e.g., year vs. day).

 Should plan for twice the size of the raw data for ROLAP
summaries in most environments.

 Assuming "to-date" summaries, every (new or


updated)detail record that is received into warehouse
must aggregate into EVERY summary table.
40 Datawarehousing & Datamining-Lecture 5
ROLAP Issue: Hierarchies
 Dimensions are NOT always simple hierarchies

 Dimensions can be more than simple hierarchies i.e. item,


subcategory, category, etc.

 The product dimension might also branch off by trade style


that cross simple hierarchy boundaries such as:
 Looking at sales of air conditioners that cross manufacturer
boundaries, such as COY1, COY2, COY3 etc.
 Looking at sales of all “green colored” items that even cross product
categories (washing machine, refrigerator, split-AC, etc.).
 Looking at a combination of both.
 Will result in a combinatorial explosion

41 Datawarehousing & Datamining-Lecture 5


ROLAP Issue: Convention
 Conventions are NOT absolute
 Example: What is calendar year? What is a week?
 Calendar:
 01 Jan. to 31 Dec or
 01 Jul. to 30 Jun. or
 01 Sep to 30 Aug.
 Week:
 Mon. to Sat. or Sat. to Thu.

42 Datawarehousing & Datamining-Lecture 5


ROLAP Issue: Storage space explosion
 Summary tables required for non-standard grouping
 item with flag as compared to items without a flag because of
National Independence Day

 In some cases bringing all the departments on the same grid


for agreeing on the same definition of year or week may not
be advisable (convention issue)
 In that case summary tables required along different definitions of
year, week etc.

 Brute force approach would quickly overwhelm the system


storage capacity due to a combinatorial explosion.
 Brute force Path: creating all possible summary tables with
all possible definitions, groupings and nomenclatures

43 Datawarehousing & Datamining-Lecture 5


How to Reduce Summary tables?
 Many ROLAP products have developed means to reduce
the number of summary tables by:

 Building summaries on-the-fly as required by end-user


applications.

 Enhancing performance on common queries at coarser


granularities.
 summaries from one (more detailed) level of aggregation and roll
them up into a less detailed summary

 Providing smart tools to assist DBAs in selecting the "best”


aggregations to build i.e. trade-off between speed and space.

44 Datawarehousing & Datamining-Lecture 5


Performance vs. Space Trade-Off
 Maximum performance boost implies using lots of disk space
for storing every pre-calculation.

 Minimum performance boost implies no disk space with zero


pre-calculation.

 Using meta data to determine best level of pre-aggregation


from which all other aggregates can be computed.
 Meta data: data about data, e.g.
 How many stores per zone? How many zones per district? Etc
 It gives a fair idea where the aggregation is going to have the biggest
bang

45 Datawarehousing & Datamining-Lecture 5


Performance vs. Space Trade-off using Wizard
 Microsoft's Usage-Based Optimization Wizard also allows the DBA to tell Analytic
(OLAP) Services to create a new set of aggregations for all queries exceeding a
defined response time threshold.

100 Aggregation answers


most queries
80 

% Gain
60

40 Aggregation answers
few queries

20

2 4 MB 6 8
46 Datawarehousing & Datamining-Lecture 5
Let’s put some steps forward

47 Datawarehousing & Datamining-Lecture 5


Loading the Data Warehouse
Data is periodically
extracted

Data is cleansed and


transformed

Users query the data


warehouse

Source Systems Data Staging Area Data Warehouse


48 Datawarehousing & Datamining-Lecture 5
(OLTP)
Terminology: ETL
 ETL = Extraction, Transformation, & Load
 Extraction: Get the data out of the source systems
 Transformation: Convert the data into a useful
format for analysis
 Load: Get the data into the data warehouse
(…and build indexes, materialized views, etc.)
 We will return to this topic in a couple weeks.

49 Datawarehousing & Datamining-Lecture 5


Data Integration is Hard
 Data warehouses combine data from multiple sources
 Data must be translated into a consistent format
 Data integration represents ~80% of effort for a typical data
warehouse project!
 Some reasons why it’s hard:
 Metadata is often poor or non-existent
 Data quality is often bad
 Missing or default values
 Multiple spellings of the same thing
(Cal vs. UC Berkeley vs. University of California)
 Inconsistent semantics
 What is an airline passenger?

50 Datawarehousing & Datamining-Lecture 5


Federated Databases

 An alternative to data warehouses


 Data warehouse
 Create a copy of all the data
 Execute queries against the copy
 Federated database
 Pull data from source systems as needed to answer queries
 “lazy” vs. “eager” data integration
 Eager  data warehousing
lazy  federated databases Rewritten
Query Queries
Extraction Query

Answer Answer Mediator


Warehouse
Source
Source Systems
51 Data Warehouse Federated
Datawarehousing & Datamining-Lecture 5 Database
Systems
Warehouses vs. Federation
 Advantages of federated databases:
 No redundant copying of data
 Queries see “real-time” view of evolving data
 More flexible security policy
 Disadvantages of federated databases:
 Analysis queries place extra load on transactional systems
 Query optimization is hard to do well
 Historical data may not be available
 Complex “wrappers” needed to mediate between analysis server and
source systems
 Data warehouses are much more common in practice
 Better performance
 Lower complexity
 Slightly out-of-date data is acceptable

52 Datawarehousing & Datamining-Lecture 5


Two Approaches to Data Warehousing
 Data mart: like a data warehouse, but smaller and more
focused
 Top-down approach
 First build single unified data warehouse with all enterprise data
 Then create data marts containing specialized subsets of the data from
the warehouse
 Bottom-up approach
 First build a data mart to solve the most pressing problem
 Then build another data mart, then another
 Data warehouse = union of all data marts
 In practice, not much difference between the two
 Our book advocates the bottom-up approach

53 Datawarehousing & Datamining-Lecture 5


Logical Database Design

 Logical design vs. physical design:


 Logical design = conceptual organization for the database
 Create an abstraction for a real-world process
 Physical design = how is the data stored
 Select data structures (tables, indexes, materialized views)
 Organize data structures on disk
 Three main goals for logical design:
 Simplicity
 Expressiveness
 Performance

54 Datawarehousing & Datamining-Lecture 5


Goals for Logical Design
 Simplicity
 Users should understand the design
 Data model should match users’ conceptual model
 Queries should be easy and intuitive to write
 Expressiveness
 Include enough information to answer all important queries
 Include all relevant data (without irrelevant data)
 Performance
 An efficient physical design should be possible

55 Datawarehousing & Datamining-Lecture 5


Another perspective…
 From a presentation at SIGMOD*:
 Three design rules:
 Simplicity
 Simplicity
 Simplicity

*by Jeff Byard and Donovan Schneider, Red Brick Systems


56 Datawarehousing & Datamining-Lecture 5
Any Question?

57 Datawarehousing & Datamining-Lecture 5

You might also like