Lecture 4: Queries, Query Processing and Optimization: Data Warehouse, Business Intelligence, Data Mining

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 43

Data Warehouse, Business Intelligence, Data Mining

Lecture 4: Queries, query processing


and optimization
Meike Klettke
Fakultät für Informatik und Elektrotechnik
meike.klettke@uni-rostock.de
Content of the lecture

1. Subtasks of query processing


2. Characteristics of multidimensional queries
3. Relational implementation of queries
1. Star Joins
2. Optimization aspects
4. SQL extensions for OLAP queries
1. CUBE
2. GROUPING
3. ROLLUP

2
Introduction (1/2)
 Typical queries on Data Warehouses contain
aggregations, such as
How many articles in the product group electronic
devices were sold per month in each region in 2017?
 Characteristics of typical Data Warehouse queries:
 covering a large amount of tuples
 selecting tuples in several or
all dimensions
 often an aggregate function is
executed

Apple
3
Introduction (2/ 2)

 multidimensional query:
 queries include many or even all dimensions
 specific optimization techniques are useful!
 DataCube contains a lot of data
 problem: aggregations on large data sets

4
Query processing: Overview

SQL query Result

Translation & Query execution


view resolving

Algebra Code
Standardization
& Code generation
Logical simplification
Optimization
Algebra Plan

Physical Plan
Optimization
Optimization parameterization

Cost-based Plan
Choice
Compile Time Run Time

5
Query processing phases (1/2)

 Translation & view resolving


 Transformation into query plan (relational algebra)
 Standardization & simplification
 Interleaving, simplification of expressions
 Logical / algebraic optimization
 Optimization without consideration of concrete ways of
storage and index structures
 Application of heuristics

6
Query processing phases (2/2)

 Physical / internal optimization


 Inclusion of specific access paths and algorithms (e.g.
index scans, composite implementation)
 Cost-based Selection
 Cost estimation (statistics) and selection of the best
plan
 Plan parameterization
 Code generation
 Execution

7
We already have seen multidimension
operations ..

 dice
 slice
 rollup
 drill down
 drill across
 rotate

 which can be used for .. (see next slide)

8
Different kinds of
multidimensional queries
Range query Partial-match request
Product (item)

Product (item)
Time (days) Time (days)

Partial range query Point request

Product (item)
Product (item)

Time (days) Time (days)

8
Relational implementation of multi-
dimensional queries

 Basically, depending on mapping for scheme


 Star vs. Snowflake scheme
 Classification hierarchies
 Frequent query pattern
 Joins between the n dimension tables and the fact
table as well as
 Restrictions (Selections) on dimension tables
 Grouping
 Aggregate functions

10
Star-join: Example

select G.Branch, Z.Year, sum(Sales) as Total_sales


from ((sale V join Geography G on
V.Geography_ID = G.GeographY_ID)
join Product P on V.Product_ID = P.Product_ID)
join Time Z on V.Time_ID = Z.Time_ID
where P.Product_group ='Textbook-Computer Science'
group by G.Branch, Z.Year

(query from lecture 4 (Storage of multidimensional data model))

11
Star-join: Construction
 select clause
 Key figures
 possibly aggregated: SUM, AVG, MAX, MIN, COUNT
 Granularity of the result, e.g. month, region
 from clause
 facts and dimension tables, Join conditions
 where clause
 restrictions (e.g .:
( Product.Product_category = 'Textbook' and
Geography.Country = 'Germany' and Time.Year = 2017)
 group by
 Grouping due to the granularity of the result

12
A bit longer quote

 "A star query is a join between a fact table and dimension


tables of a number.
 Each dimension table is joined to the fact table using a
primary key to foreign key join, but the dimension tables
are not joined to eachother.
 The cost-based optimizer recognizes star queries and
generate efficient execution plans for rates them."

 Statement from the Oracle websites

13
Optimizing Star Joins

 Star Joins provide a typical pattern for Data


Warehouse queries
 typical characteristics of the Star scheme:
 very large fact table
 considerably smaller, independent dimension
tables
⇒ Heuristics of classical relational optimizers often fail
at this point!
⇒ Cost-based optimizers, which include estimations
of size, are necessary

14
Optimizing Star Joins / 2
 Example: Join over fact table Sale and the three
dimension tables Product, Time and Geography:
 4-way Join;
 In RDBMS usually only pairwise Join: sequence of
pairwise Joins necessary
 4! possible Join orders
 Considered as heuristic are only the Joins that
 are linked by joining criterion in queries
 Joins between relations that are not linked by a
join condition in the query are not considered
 To reduce the the number of possibe execution
plans

15
Optimizing Star Joins / 2

Assumptions:
 Table Sales: 10,000,000 records
 10 stores in Germany (out of 100)
 20 selling days in January 2017 (of 1000 saved days)
 50 products in product category „Textbook-Computer
Science" (out of 1000)
 Equal distribution / same selectivity of the single
values

16
Optimizing Star Joins / 3
 Heuristic provides e.g. the following execution plan:

Plan A:
σProduct_category ='Textbook-Computer
Science'
σMonth ='January 2017'

Product
σCountry ='Germany'
Sale
Time

Geography

16
Optimizing Star Joins / 4
 The following execution plan is usually not considered
(with cross product of the dimension tables):

Plan B:
Sale

σProduct_category ='Textbook-Computer Science'

σCountry ='Germany' σMonth ='January 2017'


Product

Geography Time

17
Star Join: Example of calculation
 Assumptions:
 Table Sales: 10,000,000 records
 10 stores in Germany (out of 100)
 20 selling days in January 2017 (of 1000 saved
days)
 50 products in product category „Textbook-Computer
Science" (out of 1000)
 Equal distribution / same selectivity of the single
values
Plan A: 1. Join: 1,000,000 tuples (as result)
2. Join: 20,000 tuple
3. Join: 1,000 tuples
Plan B: 1. Cross product: 200 tuples (as result)
2. Cross Product: 10,000 tuples
Join: 1,000 tuples

18
Oracle optimizer hints
 select /* + star */ from ...
 Hint to the optimizer that Star-Query is not entirely
uncontroversial (nicht unumstritten) 

 Overview of Oracle optimizer hints:


 http://www.adp-gmbh.ch/ora/sql/hints/index.html

Hint categories
 Hints for Optimization Approaches and Goals,
 Hints for Access Paths, Hints for Query Transformations,
 Hints for Join Orders
 Hints for Join Operations,
 Hints for Parallel Execution,
 Additional Hints

19
Grouping and aggregation
 Data analysis: aggregation of multi-dimensional data
 SQL extensions for OLAP:

 group by [ Group-by list


|rollup ( Group-by list ) identical
|( Group-by list ) with rollup
|cube ( Group-by list ) identical
|( Group-by list ) with cube
|grouping sets ( Group-by sets)]
[having search condition ]

20
Grouping Sets
 Used SQL syntax
 group by grouping sets ((A, B), (A, C), (C));

 As a result, defined sets


 (A, B)
 (A, C)
 (C)
 Aggregate function is therefore calculated only for
the specified attribute sets and included in the result

21
Example
Initial data: Region Year Sale
MV 2017 20
Brandenburg 2017 30
Berlin 2017 50
MV 2016 23
Brandenburg 2016 35
Berlin 2016 40

group by grouping sets (Year)


Region Year sum(Sale)
NULL 2017 100
NULL 2016 98

22
Example
Region Year Sale
Initial data:
MV 2017 20
Brandenburg 2017 30
Berlin 2017 50
MV 2016 23
Brandenburg 2016 35
Berlin 2016 40

group by Region Year sum (Sale)


grouping sets (Region) MV NULL 43
Brandenburg NULL 65
Berlin NULL 90

23
Grouping sets, another example
select Year, Quarter, sum (Orders) as Orders
from sales_order
group by grouping sets ((Year, Quarter) (Year))
order by Year, Quarter

Result contains aggregate


values
• for all years and quarters, as well as
• for all years
24
Understanding GROUPING SETS / 1

Grouping Sets query Query without Grouping


Sets

select a, b, sum(c) select a, b, sum(c)


from tab1 from tab1
group by grouping sets group by a, b
((a, b))

25
Understanding GROUPING SETS / 2

Grouping Sets query Query without Grouping


Sets

select a, b, sum(c) ( select a, b, sum(c)


from tab1 from tab1
group by grouping sets group by a, b )
((a, b) ,a) union
( select a, null, sum(c)
from tab1
group by a )

26
Understanding GROUPING SETS / 3

Grouping Sets query Query without Grouping


Sets

select a, b, sum(c) select a, null, sum(c)


from tab1 from tab1
group by grouping sets (a, b) group by a
union
select null, b, sum(c)
from tab1
group by b

27
Understanding GROUPING SETS / 4
Grouping Sets query Query without Grouping
Sets
select a, b, c, sum(d) ( select a, null, null, sum(d)
from t from t
group by grouping sets group by a)
(a, b, c) union
( select null, b, null, sum(d)
from t
group by b)
union
( select null, null, c, sum(d)
from t
group by c )
28
RollUp operator
 RollUp
 Sum of rows (or subtotal lines) are inserted into the
result set of a query with a group by clause.
 Clause specifies, on which dimension attributes the
RollUp operation is executed
 rollup(region, year)
 Also usable to form aggregates along dimensions
 Example:
 rollup(year, month, day)
 rollup(country, state, location)

29
RollUp: Example
select Year, Quarter, sum (Orders) as Orders
from sales_order ...
group by rollup (Year, Quarter)
order by Year, Quarter

1) Total number of
orders (Year,
Quarter)
2) Orders / year
(Year)

30
RollUp: Example / 2

select Year, Quarter, Region,


sum(Orders) As Orders
from sales_order
where Region in ('Canada',
'Eastern') and Quarter in ('1', '2')
group by rollup (Year, Quarter,
Region)
order by Year, Quarter, Region

1) Total number of orders (Year, Quarter, Region)


.. Orders by year and quarter (Year, Quarter)
2) Orders / year (Year)

31
Understanding Rollup

ROLLUP query Query without ROLLUP

select a, b, c, sum(d) ( select a, b, c, sum(d) from t group by a, b, c )


union
from T1
( select a, b, null, sum(d) from t group by a, b )
group by rollup (a, b, c);
union
( select a, null, null, sum(d) from t group by a )
union
( select null, null, null, sum(d) from t )

32
Understanding Rollup /2

Rollup query Query without Rollup

( select a, b, c, sum(d)
select a, b, c, sum(d)
from t
from t
group by a, b, c)
group by rollup (a, (b, c))
union all
( select a, null, null, sum(d)
from t
group by a)
union all
( select null, null, null, sum(d)
from t
group by ())

28
CUBE operator

 "Short form" of query patterns for calculating partial and


total sums
 Generation of all possible grouping combinations out of
a given set of grouping attributes
 Result: table with aggregated values
 Total aggregate:
null, null, ..., null, f(*)

33
CUBE operator: Example
Product Region Year Sales
Product Region Year Sales Data Warehouses SANH 2017 45
Data Warehouses SANH 2017 45 Data Warehouses THÜR 2017 43
Data Warehouses THÜR 2017 43 CUBE ... ... ... ...
Data Warehouses SANH 2016 47 Data Warehouses SANH NULL 92
Data Warehouses THÜR 2016 42 Data Warehouses THÜR NULL 85

Data Warehouses NULL 2017 88

sum Data Warehouses NULL 2012 89

Data Warehouses NULL NULL 177

NULL SANH 2017 45


sum
... ... ... ...

sum NULL NULL 2017 88

NULL NULL 2016 89

NULL NULL NULL 177

34
Cube
 Used SQL syntax
group by cube (A, B, C);
 Results in defined set
 (A, B, C)
 (A, B)
 (A, C)
 (B, C)
 (A)
 (B)
 (C)
 ()

35
Example für the Cube operator
select Year, Quarter, sum(Orders) as Orders
from sales_order
group by cube (Year, Quarter)
order by Year, Quarter

1) Total number of orders (Year,


Quarter)
2) Orders per quarter (Quarter)
3) Orders per year (Year)

36
Understanding CUBE

 The CUBE operation is equivalent to a Grouping


Sets query that contains all possible combinations of
variables.

 CUBE query  Query without CUBE

select A, B, C, sum(D) select A, B, C, sum(D)


from t from t
group by cube (A, B, C) group by grouping sets
( (A, B, C), (A, B), (A), (B, C),
(B), (A, C), (C), () )

37
CUBE operator: SQL syntax
 cube: extension of group by
 Implementation in SQL Server, DB2, since Oracle 8i
 Syntax:

select Product, Region, Year, sum(Sales)


from Sale
group by cube(Product_group, Region, Year)

 Result: contains all values ​that are calculated by analog RollUp


statements and additionally their combinations
 (with n dimensions: every 2 .. n-digit combinations)

38
Iceberg query

select A, B, C, count(*), sum (X)


from R
group by cube(A, B, C)
having count (*)> = N

 having – Conditions of grouped values


 Usage:
 Derivation of "extremes"
 In the example: Giving all examples that have a minimum support
(cf. association rules)

39
Summary Cube, RollUp, Grouping Set
operator
 Cube operator:
 Generates all combinations: e.g. for 4 grouping attributes
→ 16 combinations, for 3 grouping attributes: 8
combinations
 Example: (A, B, C) → (A, B, C) (A, B) (A, C) (B, C) (A) (B)
(C) ()
 RollUp operator:
 Generates only combinations with superaggregates:
 For the example: (A, B, C) → sum
() (A) (A, B) (A, B, C)
 Grouping Sets sum
 Generates only aggregate values ​for sum
exactly the specified set
 For the example: (A, B, C) → (A, B, C)
40
Outlook
 We will also have an exercise on the topic of SQL
extensions for warehouses
 also: building the Star scheme, use it
 Queries with grouping sets, rollup, cube should be
tested at that

 besides: MDX: Alternative to queries on


multidimensional structures

41

You might also like