Lecture 4: Queries, Query Processing and Optimization: Data Warehouse, Business Intelligence, Data Mining

Data Warehouse, Business Intelligence, Data Mining
Lecture 4: Queries, query processing

and optimization
Meike Klettke
Fakultät für Informatik und Elektrotechnik
meike.klettke@uni-rostock.de
Content of the lecture
1. Subtasks of query processing

2. Characteristics of multidimensional queries
3. Relational implementation of queries
1. Star Joins
2. Optimization aspects
4. SQL extensions for OLAP queries
1. CUBE
2. GROUPING
3. ROLLUP
2
Introduction (1/2)
 Typical queries on Data Warehouses contain
aggregations, such as
How many articles in the product group electronic
devices were sold per month in each region in 2017?
 Characteristics of typical Data Warehouse queries:
 covering a large amount of tuples
 selecting tuples in several or
all dimensions
 often an aggregate function is
executed
Apple
3
Introduction (2/ 2)
 multidimensional query:
 queries include many or even all dimensions
 specific optimization techniques are useful!
 DataCube contains a lot of data
 problem: aggregations on large data sets
4
Query processing: Overview
SQL query Result
Translation & Query execution

view resolving
Algebra Code
Standardization
& Code generation
Logical simplification
Optimization
Algebra Plan
Physical Plan
Optimization
Optimization parameterization
Cost-based Plan
Choice
Compile Time Run Time
5
Query processing phases (1/2)
 Translation & view resolving

 Transformation into query plan (relational algebra)
 Standardization & simplification
 Interleaving, simplification of expressions
 Logical / algebraic optimization
 Optimization without consideration of concrete ways of
storage and index structures
 Application of heuristics
6
Query processing phases (2/2)
 Physical / internal optimization

 Inclusion of specific access paths and algorithms (e.g.
index scans, composite implementation)
 Cost-based Selection
 Cost estimation (statistics) and selection of the best
plan
 Plan parameterization
 Code generation
 Execution
7
We already have seen multidimension
operations ..
 dice
 slice
 rollup
 drill down
 drill across
 rotate
 which can be used for .. (see next slide)
8
Different kinds of
multidimensional queries
Range query Partial-match request
Product (item)
Product (item)
Time (days) Time (days)
Partial range query Point request
Product (item)
Product (item)
Time (days) Time (days)
8
Relational implementation of multi-
dimensional queries
 Basically, depending on mapping for scheme

 Star vs. Snowflake scheme
 Classification hierarchies
 Frequent query pattern
 Joins between the n dimension tables and the fact
table as well as
 Restrictions (Selections) on dimension tables
 Grouping
 Aggregate functions
10
Star-join: Example
select G.Branch, Z.Year, sum(Sales) as Total_sales

from ((sale V join Geography G on
V.Geography_ID = G.GeographY_ID)
join Product P on V.Product_ID = P.Product_ID)
join Time Z on V.Time_ID = Z.Time_ID
where P.Product_group ='Textbook-Computer Science'
group by G.Branch, Z.Year
(query from lecture 4 (Storage of multidimensional data model))
11
Star-join: Construction
 select clause
 Key figures
 possibly aggregated: SUM, AVG, MAX, MIN, COUNT
 Granularity of the result, e.g. month, region
 from clause
 facts and dimension tables, Join conditions
 where clause
 restrictions (e.g .:
( Product.Product_category = 'Textbook' and
Geography.Country = 'Germany' and Time.Year = 2017)
 group by
 Grouping due to the granularity of the result
12
A bit longer quote
 "A star query is a join between a fact table and dimension

tables of a number.
 Each dimension table is joined to the fact table using a
primary key to foreign key join, but the dimension tables
are not joined to eachother.
 The cost-based optimizer recognizes star queries and
generate efficient execution plans for rates them."
 Statement from the Oracle websites
13
Optimizing Star Joins
 Star Joins provide a typical pattern for Data

Warehouse queries
 typical characteristics of the Star scheme:
 very large fact table
 considerably smaller, independent dimension
tables
⇒ Heuristics of classical relational optimizers often fail
at this point!
⇒ Cost-based optimizers, which include estimations
of size, are necessary
14
Optimizing Star Joins / 2
 Example: Join over fact table Sale and the three
dimension tables Product, Time and Geography:
 4-way Join;
 In RDBMS usually only pairwise Join: sequence of
pairwise Joins necessary
 4! possible Join orders
 Considered as heuristic are only the Joins that
 are linked by joining criterion in queries
 Joins between relations that are not linked by a
join condition in the query are not considered
 To reduce the the number of possibe execution
plans
15
Assumptions:
 Table Sales: 10,000,000 records
 10 stores in Germany (out of 100)
 20 selling days in January 2017 (of 1000 saved days)
 50 products in product category „Textbook-Computer
Science" (out of 1000)
 Equal distribution / same selectivity of the single
values
16
 Heuristic provides e.g. the following execution plan:
Plan A:
σProduct_category ='Textbook-Computer
Science'
σMonth ='January 2017'
Product
σCountry ='Germany'
Sale
Time
Geography
16
 The following execution plan is usually not considered
(with cross product of the dimension tables):
Plan B:
Sale
σProduct_category ='Textbook-Computer Science'
σCountry ='Germany' σMonth ='January 2017'

Product
Geography Time
17
Star Join: Example of calculation
 Assumptions:
 Table Sales: 10,000,000 records
 10 stores in Germany (out of 100)
 20 selling days in January 2017 (of 1000 saved
days)
 50 products in product category „Textbook-Computer
Science" (out of 1000)
 Equal distribution / same selectivity of the single
values
Plan A: 1. Join: 1,000,000 tuples (as result)
2. Join: 20,000 tuple
3. Join: 1,000 tuples
Plan B: 1. Cross product: 200 tuples (as result)
2. Cross Product: 10,000 tuples
Join: 1,000 tuples
18
Oracle optimizer hints
 select /* + star */ from ...
 Hint to the optimizer that Star-Query is not entirely
uncontroversial (nicht unumstritten) 
 Overview of Oracle optimizer hints:

 http://www.adp-gmbh.ch/ora/sql/hints/index.html
Hint categories
 Hints for Optimization Approaches and Goals,
 Hints for Access Paths, Hints for Query Transformations,
 Hints for Join Orders
 Hints for Join Operations,
 Hints for Parallel Execution,
 Additional Hints
19
Grouping and aggregation
 Data analysis: aggregation of multi-dimensional data
 SQL extensions for OLAP:
 group by [ Group-by list

|rollup ( Group-by list ) identical
|( Group-by list ) with rollup
|cube ( Group-by list ) identical
|( Group-by list ) with cube
|grouping sets ( Group-by sets)]
[having search condition ]
20
Grouping Sets
 Used SQL syntax
 group by grouping sets ((A, B), (A, C), (C));
 As a result, defined sets

 (A, B)
 (A, C)
 (C)
 Aggregate function is therefore calculated only for
the specified attribute sets and included in the result
21
Example
Initial data: Region Year Sale
MV 2017 20
Brandenburg 2017 30
Berlin 2017 50
MV 2016 23
Brandenburg 2016 35
Berlin 2016 40
group by grouping sets (Year)

Region Year sum(Sale)
NULL 2017 100
NULL 2016 98
22
Example
Region Year Sale
Initial data:
MV 2017 20
Brandenburg 2017 30
Berlin 2017 50
MV 2016 23
Brandenburg 2016 35
Berlin 2016 40
group by Region Year sum (Sale)

grouping sets (Region) MV NULL 43
Brandenburg NULL 65
Berlin NULL 90
23
Grouping sets, another example
select Year, Quarter, sum (Orders) as Orders
from sales_order
group by grouping sets ((Year, Quarter) (Year))
order by Year, Quarter
Result contains aggregate

values
• for all years and quarters, as well as
• for all years
24
Understanding GROUPING SETS / 1
Grouping Sets query Query without Grouping

Sets
select a, b, sum(c) select a, b, sum(c)

from tab1 from tab1
group by grouping sets group by a, b
((a, b))
25

Sets
select a, b, sum(c) ( select a, b, sum(c)

from tab1 from tab1
group by grouping sets group by a, b )
((a, b) ,a) union
( select a, null, sum(c)
from tab1
group by a )
26

Sets
select a, b, sum(c) select a, null, sum(c)

from tab1 from tab1
group by grouping sets (a, b) group by a
union
select null, b, sum(c)
from tab1
group by b
27
Sets
select a, b, c, sum(d) ( select a, null, null, sum(d)
from t from t
group by grouping sets group by a)
(a, b, c) union
( select null, b, null, sum(d)
from t
group by b)
union
( select null, null, c, sum(d)
from t
group by c )
28
RollUp operator
 RollUp
 Sum of rows (or subtotal lines) are inserted into the
result set of a query with a group by clause.
 Clause specifies, on which dimension attributes the
RollUp operation is executed
 rollup(region, year)
 Also usable to form aggregates along dimensions
 Example:
 rollup(year, month, day)
 rollup(country, state, location)
29
RollUp: Example
select Year, Quarter, sum (Orders) as Orders
from sales_order ...
group by rollup (Year, Quarter)
1) Total number of
orders (Year,
Quarter)
2) Orders / year
(Year)
30
RollUp: Example / 2
select Year, Quarter, Region,

sum(Orders) As Orders
from sales_order
where Region in ('Canada',
'Eastern') and Quarter in ('1', '2')
group by rollup (Year, Quarter,
Region)
order by Year, Quarter, Region
1) Total number of orders (Year, Quarter, Region)

.. Orders by year and quarter (Year, Quarter)
2) Orders / year (Year)
31
Understanding Rollup
ROLLUP query Query without ROLLUP
select a, b, c, sum(d) ( select a, b, c, sum(d) from t group by a, b, c )

union
from T1
( select a, b, null, sum(d) from t group by a, b )
group by rollup (a, b, c);
union
( select a, null, null, sum(d) from t group by a )
union
( select null, null, null, sum(d) from t )
32
Understanding Rollup /2
Rollup query Query without Rollup
( select a, b, c, sum(d)
select a, b, c, sum(d)
from t
from t
group by a, b, c)
group by rollup (a, (b, c))
union all
( select a, null, null, sum(d)
from t
group by a)
union all
( select null, null, null, sum(d)
from t
group by ())
28
CUBE operator
 "Short form" of query patterns for calculating partial and

total sums
 Generation of all possible grouping combinations out of
a given set of grouping attributes
 Result: table with aggregated values
 Total aggregate:
null, null, ..., null, f(*)
33
CUBE operator: Example
Product Region Year Sales
Product Region Year Sales Data Warehouses SANH 2017 45
Data Warehouses SANH 2017 45 Data Warehouses THÜR 2017 43
Data Warehouses THÜR 2017 43 CUBE ... ... ... ...
Data Warehouses SANH 2016 47 Data Warehouses SANH NULL 92
Data Warehouses THÜR 2016 42 Data Warehouses THÜR NULL 85
Data Warehouses NULL 2017 88
sum Data Warehouses NULL 2012 89
Data Warehouses NULL NULL 177
NULL SANH 2017 45

sum
... ... ... ...
sum NULL NULL 2017 88
NULL NULL 2016 89
NULL NULL NULL 177
34
Cube
 Used SQL syntax
group by cube (A, B, C);
 Results in defined set
 (A, B, C)
 (A, B)
 (A, C)
 (B, C)
 (A)
 (B)
 (C)
 ()
35
Example für the Cube operator
select Year, Quarter, sum(Orders) as Orders
from sales_order
group by cube (Year, Quarter)
1) Total number of orders (Year,

Quarter)
2) Orders per quarter (Quarter)
3) Orders per year (Year)
36
Understanding CUBE
 The CUBE operation is equivalent to a Grouping

Sets query that contains all possible combinations of
variables.
 CUBE query  Query without CUBE
select A, B, C, sum(D) select A, B, C, sum(D)

from t from t
group by cube (A, B, C) group by grouping sets
( (A, B, C), (A, B), (A), (B, C),
(B), (A, C), (C), () )
37
CUBE operator: SQL syntax
 cube: extension of group by
 Implementation in SQL Server, DB2, since Oracle 8i
 Syntax:
select Product, Region, Year, sum(Sales)

from Sale
group by cube(Product_group, Region, Year)
 Result: contains all values that are calculated by analog RollUp

statements and additionally their combinations
 (with n dimensions: every 2 .. n-digit combinations)
38
Iceberg query
select A, B, C, count(*), sum (X)

from R
group by cube(A, B, C)
having count (*)> = N
 having – Conditions of grouped values

 Usage:
 Derivation of "extremes"
 In the example: Giving all examples that have a minimum support
(cf. association rules)
39
Summary Cube, RollUp, Grouping Set
operator
 Cube operator:
 Generates all combinations: e.g. for 4 grouping attributes
→ 16 combinations, for 3 grouping attributes: 8
combinations
 Example: (A, B, C) → (A, B, C) (A, B) (A, C) (B, C) (A) (B)
(C) ()
 RollUp operator:
 Generates only combinations with superaggregates:
 For the example: (A, B, C) → sum
() (A) (A, B) (A, B, C)
 Grouping Sets sum
 Generates only aggregate values for sum
exactly the specified set
 For the example: (A, B, C) → (A, B, C)
40
Outlook
 We will also have an exercise on the topic of SQL
extensions for warehouses
 also: building the Star scheme, use it
 Queries with grouping sets, rollup, cube should be
tested at that
 besides: MDX: Alternative to queries on

multidimensional structures
41

Lecture 4: Queries, Query Processing and Optimization: Data Warehouse, Business Intelligence, Data Mining

Uploaded by

Copyright:

Available Formats

You might also like

Lecture 4: Queries, Query Processing and Optimization: Data Warehouse, Business Intelligence, Data Mining

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 4: Queries, Query Processing and Optimization: Data Warehouse, Business Intelligence, Data Mining

Uploaded by

Copyright:

Available Formats

Data Warehouse, Business Intelligence, Data Mining

Lecture 4: Queries, query processing

1. Subtasks of query processing

SQL query Result

Translation & Query execution

 Translation & view resolving

 Physical / internal optimization

 which can be used for .. (see next slide)

Partial range query Point request

Time (days) Time (days)

 Basically, depending on mapping for scheme

select G.Branch, Z.Year, sum(Sales) as Total_sales

(query from lecture 4 (Storage of multidimensional data model))

 "A star query is a join between a fact table and dimension

 Statement from the Oracle websites

 Star Joins provide a typical pattern for Data

σProduct_category ='Textbook-Computer Science'

σCountry ='Germany' σMonth ='January 2017'

 Overview of Oracle optimizer hints:

 group by [ Group-by list

 As a result, defined sets

group by grouping sets (Year)

group by Region Year sum (Sale)

Result contains aggregate

Grouping Sets query Query without Grouping

select a, b, sum(c) select a, b, sum(c)

Grouping Sets query Query without Grouping

select a, b, sum(c) ( select a, b, sum(c)

Grouping Sets query Query without Grouping

select a, b, sum(c) select a, null, sum(c)

select Year, Quarter, Region,

1) Total number of orders (Year, Quarter, Region)

ROLLUP query Query without ROLLUP

select a, b, c, sum(d) ( select a, b, c, sum(d) from t group by a, b, c )

Rollup query Query without Rollup

 "Short form" of query patterns for calculating partial and

Data Warehouses NULL 2017 88

sum Data Warehouses NULL 2012 89

Data Warehouses NULL NULL 177

NULL SANH 2017 45

sum NULL NULL 2017 88

NULL NULL 2016 89

NULL NULL NULL 177

1) Total number of orders (Year,

 The CUBE operation is equivalent to a Grouping

 CUBE query  Query without CUBE

select A, B, C, sum(D) select A, B, C, sum(D)

select Product, Region, Year, sum(Sales)

 Result: contains all values ​that are calculated by analog RollUp

select A, B, C, count(*), sum (X)

 having – Conditions of grouped values

 besides: MDX: Alternative to queries on

You might also like

 Result: contains all values that are calculated by analog RollUp