10 - Cloud-Based OLAP Over Big Data Application Scenarious and Performance Analysis

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing

Cloud-based OLAP over Big Data: Application


Scenarios and Performance Analysis

Alfredo Cuzzocrea∗ , Rim Moussa† , Guandong Xu‡ and Giorgio Mario Grasso§
∗ ICAR-CNRand University of Calabria, Cosenza, Italy
Email: cuzzocrea@si.dimes.unical.it
† ZENITH, INRIA, Sophia Antipolis, France
Email: rim.moussa@inria.fr
‡ University of Technology, Australia

Email: guandong.xu@uts.edu.au
§ CSECS Department, University of Messina, Messina, Italy
Email: gmgrasso@unime.it

Abstract—Following our previous research results, in this With the goal of complimenting previous result [1], in this
paper we provide two authoritative application scenarios that paper we complement previous research efforts by providing
build on top of OLAP*, a middleware for parallel processing of two authoritative application scenarios that build on top of
OLAP queries that truly realizes effective and efficiently OLAP OLAP*, namely parallel OLAP data cube processing (Section
over Big Data. We have provided two authoritative case studies, II) and virtual OLAP data cube design (Section III), for
namely parallel OLAP data cube processing and virtual OLAP
data cube design, for which we also propose a comprehensive
which we provide a detailed analysis along with performance
performance evaluation and analysis. Derived analysis clearly evaluation and analysis. We conclude our paper in Section
confirms the benefits of our proposed framework. IV by summarizing our research contributions and putting the
basis for future work.
I. I NTRODUCTION
II. A PPLICATION S CENARIO : PARALLEL OLAP DATA
In a previous research experience [1], we have investigated C UBE P ROCESSING
solutions relying on data partitioning schemes for parallel When dealing with huge data sets, most OLAP systems
building of OLAP data cubes, and we described the framework require high computing capacities and are I/O-bound and CPU-
OLAP*, suitable to novel Big Data environments [2], [3], [4], bound. This is due to hard drives I/O performances, which do
with the specific goal of supporting OLAP over Big Data not evolve as fast as storage, computing hardware (Moore Law)
(e.g., [5], [6]), along with the associated benchmark TPC- and network hardware (Gilder Law). Since the 80’s with RAID
H*d, an appropriate transformation of the well-known data systems, both practitioners and experts admitted that the more
warehouse benchmark TPC-H [7] which we used to stress we divide disk I/O across disk drives, the more storage systems
OLAP*. Also, we have demonstrated the effectiveness of the outperform. In order to achieve high performance and large ca-
proposed framework through several running examples that pacity, database systems and distributed file systems rely upon
build along the principles of our framework and that has been data partitioning, parallel I/Os and parallel processing. Besides
developed on top of the well-known ROLAP server Mondrian high capacity and complex query performance requirements,
[8]. In order to accomplish different requirements posed by these applications require scalability of both data and work-
Big Data processing (e.g., [9]), OLAP*realizes a middleware load. It is well approved that the Shared-Nothing architecture
for parallel processing of OLAP queries. Summarizing, the [10], which features independent processors interconnected
OLAP* architecture is devoted to query routing and cube post- via high-speed networks, is most suited for requirements
processing over Big Data, and encompasses some emerging of scalability of both data and workload. Following these
components like Mondrian (ROLAP server) and MySQL (Re- considerations, in this Section, we investigate solutions relying
lational DBMS). It can be easily interfaced to other classical on data partitioning schemes for parallel building of OLAP
OLAP clients at the front-end side, such as JPivot, thanks to data cubes, and we propose the framework OLAP*, suitable
suitable APIs. As mentioned, OLAP* comes with an innovative to novel Big Data environments [2], [3], [4], along with the
benchmark specially targeted to multidimensional data, called associated benchmark TPC-H*d, an appropriate transformation
TPC-H*d benchmark, which is inspired to the well-known of the well-known data warehouse benchmark TPC-H [7]. We
TPC-H benchmark [7], the most prominent decision support demonstrate through performance measurements the efficiency
benchmark. Basically, TPC-H*d is a suitable transformation of of the proposed framework, developed on top of the ROLAP
the TPC-H benchmark into a multi-dimensional OLAP bench- server Mondrian [8].
mark. Indeed, each business query of the TPC-H workload is
mapped onto an OLAP cube, and a temporal dimension (Time
A. Physical Design Techniques of Distributed Data Ware-
table) is added to the data warehouse. Also, we translate the
houses
TPC-H SQL workload into a corresponding (TPC-H*d) MDX
workload. This approach involves into specific fragmentation Data partitioning aims at minimizing (i) the cost of ex-
schemas over the respective TPC-H tables [1]. ecution time of the OLAP workload through enabling intra-

978-1-4799-8006-2/15 $31.00 © 2015 IEEE 921


DOI 10.1109/CCGrid.2015.174
query parallelism and inter-query parallelism. We recall that table having the minimal share with the fact table [14].
inter-query parallelism consists in simultaneously processing An OLAP workload is composed of star-join queries, which
different queries in distinct nodes, and intra-query parallelism feature multiple and sequential joins (resulting into right or
is obtained when multiple nodes process the same query at the left -deep execution trees) to the fact table and implies finding
same time; (ii) the cost of maintenance of the data warehouse out a cost of effective order for joining dimension tables to the
through targeted and parallel refresh operations; and (iii) the fact table. As in join ordering problem [15], the best choice of
cost of ownership of a data warehouse through the use of dimension table should reduce the size of intermediate results
commodity hardware with a shared-nothing architecture rather of the most important join star queries. Based on this analysis,
than expensive server architectures. Hereafter, we overview the solution consists in perform referential partition the fact
partitioning schemes, table reducing the size of intermediate results.
First type of partitioning scheme is based on a fully repli-
cated Data Warehouse, where the data warehouse is replicated Fourth type of data warehouse partitioning is based on
on a database cluster. Load balancing of the workload among range and list-partitioning of the fact table along attributes
nodes enables inter-query parallelism, and consequently de- belonging to hierarchy levels of the dimension tables. Stohr et
creases the delay required to process a query. This schema does al. [16] propose multidimensional hierarchical fragmentation
not reduce the time to execute the query. In order to enable called MDHF for partitioning the fact table. MDHF allows
intra-query parallelism, Akal et al. propose Simple Virtual choosing multiple fragmentation attributes from different hier-
Partitioning (SVP) [11]. SVP consists in fully replicating a archy levels of the dimension tables, with each fragmentation
database over a set of nodes, and breaking each query into attribute refers to a different dimension.
sub-queries by adding predicates. Each node receives a sub-
query and consequently processes a different subset of data B. OLAP* Framework Key Considerations for Data Fragmen-
items, such that each subset is called a virtual partition. tation
Notice that SVP assumes that the DBMS embeds an efficient
query optimizer. Otherwise, a sub-query execution time will be For the design of the data fragmentation schema, we
comparable to the query execution time. Besides, the DBMS propose the following key considerations.
robustness, SVP efficiency relies on a partitioning attribute pro-
ducing equal sizes virtual partitions. For fully replicated data
warehouses, Lima et al. propose Adaptive Virtual Partitioning Reduce the Size of Each Cube to be Built at Each Node: The
(AVP) [12], where the query optimizer is responsible for the size of an OLAP cube denotes the cardinality of dimensional
best virtual partition size set-up. data. The building of some cubes is very complex. Indeed,
some dimensions have high cardinality (see examples of high-
Second type of partitioning schemes, performs primary cardinality dimensions within TPC-H*d Benchmark in Table
horizontal partitioning of the fact table and replicates all I). The data fragmentation scheme should allow parallel build-
dimensions [13]. ing of small cubes through big-cardinality dimensions’ par-
Third type of partitioning schemes proposes derived hor- titioning. Indeed, data partitioning will reduce the number of
izontal partitioning (DHP) of the fact table along selected levels to cross in aggregations, and also memory requirements.
dimension tables. Consequently, a new issue emerges which Consequently, OLAP cube building complexity and time is
is the choice of the dimension tables. The problem is denoted improved. The reduction of OLAP cubes physical sizes will
as the referential horizontal partitioning problem, and proved reduce the complexity of OLAP cube building and improves
NP-hard [14]. Hereafter, we present heuristics used for the the build time.
selection of dimension tables.
Simplify Post-Processing of Parallel OLAP Workload: every
Referential Partition: A simple solution for select a dimen- business query subject to intra-query parallelism is divided
sion table to referential partition the fact table, is to choose into sub-queries. In order to achieve performance gain, sub-
the biggest dimension table. This selection is based on the queries run in parallel. Nevertheless, post-processing should
following motivations: First, joins are expensive operations, be as simple as possible. Indeed, three possible strategies of
particularly when the relations involved are substantially larger post-processing exist depending on the fragmentation scheme:
than main memory. Consequently, DHP of the fact table along (i) if OLAP cubes have the same dimension members, then
the biggest table will reduce the cost of joining these two the resulting cube has the same dimensions’ members as any
relations. Second, the fragmentation of the biggest dimension of the OLAP cube built locally and post-processing consists
table will save storage costs. in performing sum of sums, sum of counts over locally
computed measures and so on; (ii) if OLAP cubes have
disjoint dimensions’ members, then post-processing consists
Frequency-based Analysis: The second solution is to choose in performing the union of all locally built cubes; finally,
the most frequently used dimension table(s). The frequency of (iii) if OLAP cubes share some dimensions’ members, the
a dimension table is calculated by counting its appearance in resulting OLAP cubes is obtained through the merge of all
the workload. This selection is based on reducing the time to dimensions’ hierarchies and performing sum of sums and sum
join the most frequent dimension table to the fact table. of counts for shared members and over computed measures.
Notice that both first and second strategies are simple, but the
first is memory consuming. The second allows best parallel
Minimal Sharing: The third solution is to choose the dimension computing and is the less memory-consuming. The third

922
strategy is the most complicated. In conclusion, the proposed TABLE II. TPC-H* D F RAGMENTATION S CHEMA .
data fragmentation scheme should enable parallel OLAP cube Relation Schema
building and implement simple post-processing for federating Customer PHPed along c custkey
results obtained from computing nodes. Orders DHPed along o custkey
LineItem DHPed along l orderkey
PartSupp, Supplier, Part, Region, Na- Replicated
tion, Time
Enhance Data warehouse Maintenance: Data warehouse main-
tenance addresses how changes to sources are propagated to
a warehouse, including aggregate tables, and OLAP cubes (40 distinct values). These attributes are invoked in the multiple
refreshes. The maintenance of the cube should be considered as dimensions’ hierarchies of OLAP cubes such as C8, C16, C17
a bulk incremental update operation. Record-at-a-time updates and C19.
or full re-computation are not viable solutions.
Table II shows the proposed fragmentation schema of TPC-
H*d relational data warehouse. We recall that PHP stands for
Reduce Storage Overhead: For achieving high performance, Primary Horizontal Partitioning and DHP stands for Derived
and particularly for avoiding inter-sites joins, usually data Horizontal Partitioning (e.g., [16], [17], [18]), which both are
fragmentation is combined with data replication. Replication well-known data warehouse partitioning strategies.
is also useful for load balancing (i.e., queries are processed by
replicas) and for high-availability (i.e., data is available despite With respect to the partitioning scheme shown in Table II,
some nodes’ crash). typical business queries of TPC-H [7] can be in turn partitioned
into three different types of queries, and three corresponding
different executions to handle. These query classes are the
C. TPC-H*d Fragmentation Schema following:
Hereafter, we describe TPC-H*d schema and workload • Class 1: as result of replication, business queries
characteristics that OLAP* framework has to deal with. which involve only replicated tables are executed by
any node.
Conflicting Business Queries Recommendations: In Table IV, • Class 2: business queries which involve only the
we enumerate for each big table of TPC-H*d relational partitioned tables are executed on one database node
schema, the list of partitioning alternatives as well as business as result of partitioning.
queries recommending each partitioning schema. Notice that,
we distinguish two types of conflicting recommendations, • Class 3: business queries which involve both par-
(i) conflicting recommendations issued by different business titioned and replicated tables are executed on all
queries, such as business queries Q4 and Q19. Q4-Order database nodes. In this class, we distinguish two types
Priority Checking Query which counts the number of orders of cubes’ post-processing, namely:
ordered in a given quarter of a given year in which at least one ◦ Sub-Class 3.1: cubes built at edge servers
lineitem was received by the customer later than its committed have completely different dimension mem-
date. Q4 recommends the DHP of LINEITEM along ORDERS bers, consequently the result cube is obtained
table. Q19 -Discounted Revenue query, which finds the gross by operating the UNION ALL of cubes built
discounted revenue for all orders for different types of parts at edge servers;
that were shipped by air and delivered in person. Parts are ◦ Sub-Class 3.2: cubes built at edge servers
selected based on the combination of specific brands, a list of present shared dimension members, conse-
containers, and a range of sizes. Q19 recommends the DHP of quently the result cube requires operating
LINEITEM along PART; and (ii) conflicting recommendations specific aggregate functions over measures,
issued by the same business query. For instance, business respectively sum over sum measures, sum
query Q16 -Parts/Supplier Relationship counts the number over count measures, max over max measures,
of suppliers who can supply parts that satisfy a particular and so forth.
customer’s requirements. The customer is interested in parts
of eight different sizes as long as they are not of a given D. Performance Analysis
type, not of a given brand, and not from a supplier who
Hereafter, we provide experimental results and analysis of
has had complaints registered at the Better Business Bureau.
our framework OLAP* over our proposed benchmark TPC-
Q16 recommends DHP of PARTSUPP along SUPPLIER table
H*d in comparison to the well-known ROLAP server Mon-
or along PART table.
drian [8]. Mondrian is an open source ROLAP server of
High-Cardinality Dimensions: When designing a data Pentaho BI suite, written in Java. It executes queries written in
warehouse, the number of records in dimensions tables will the MDX language, by reading data from a relational database
greatly influence the overall system performance. Some di- (RDBMS), and presents the results in a multidimensional
mensions are huge with millions of members. We call those format (a.k.a. pivot table) via a Java API [19].
High Cardinality Dimensions (HCD). Table I lists HCD within In our experimental framework, the client sends a stream
TPC-H*d benchmark, and in which OLAP cubes they show up. of MDX queries in a random order to the database tier,
Also, TPC-H*d benchmark features enumerated attributes with and measures performance of MDX queries for two different
high number of values. Indeed, within PART we cite p type workloads. The first workload stream is a Query Workload.
(150 distinct values), p size (50 distinct values), p container It is composed of TPC-H queries translated into MDX, while

923
Dimension Size OLAP Cubes
PART:p partkey SF × 200,000 C2, C11, C21
CUSTOMER:c custkey SF × 150,000 C10
SUPPLIER:s suppkey SF × 10,000 C15, C20
CUSTOMER:c name —ORDERS:o orderkey SF × 150,000 —SF × 1,500,000 C18
SUPPLIER:s name SF × 10,000 C15, C21
TABLE I. L ISTING OF H IGH -C ARDINALITY D IMENSIONS OF TPC-H*d B ENCHMARK .

Business Question(s) Big Tables


Q1, Q6, Q15 LINEITEM
Q4, Q12, Q22 LINEITEM, ORDERS
Q3, Q4, Q10, Q18 LINEITEM, ORDERS, CUSTOMER
Q5, Q7 LINEITEM, ORDERS, CUSTOMER, SUPPLIER
Q8 LINEITEM, ORDERS, CUSTOMER, SUPPLIER, PART
Q9 LINEITEM, ORDERS, SUPPLIER, PART, PARTSUPP
Q21 LINEITEM, ORDERS, SUPPLIER
Q14, Q17, Q19 LINEITEM, PART
Q20 LINEITEM, PART, SUPPLIER
Q11 PARTSUPP, SUPPLIER
Q2, Q16 PARTSUPP, SUPPLIER, PART
TABLE III. TPC-H* D B USINESS Q UERIES AND INVOKED BIG TABLES .

Table Alternatives of Fragmentation Schemas


LINEITEM
• Any fragmentation schema: Q1, Q6, Q15
• DHP along ORDERS: Q2, Q3, Q4, Q5, Q7, Q8, Q9, Q10, Q12, Q16, Q18, Q21,
Q22
• DHP along SUPPLIER: Q5, Q7, Q8, Q9, Q20, Q21
• DHP along PART: Q8, Q9, Q14, Q17, Q19, Q20

ORDERS
• DHP along CUSTOMER: Q8, Q9, Q14, Q17, Q19

PARTSUPP
• DHP along SUPPLIER: Q2, Q11, Q16
• DHP along PART: Q2, Q16

TABLE IV. TPC-H* D B USINESS Q UERIES R ECOMMENDATIONS .

the second is a Cube-then-Query Workload. It is composed of 4 cores per CPU, and run Squeeze-x64-xen-1.3 Operating
TPC-H*d cubes MDX statements followed by queries MDX System. Response times are measured over three runs, and
statements (denoted by Ci −Qi , Cj −Qj , . . .). Second workload the variance is negligible.
type should allow query result retrieval from built cubes and
consequently, it is expected to lead to better performance Experiments show that,
results. Table V shows detailed performance results for scale
factor equal to SF=10, comparing TPC-H*d workload perfor- • For some queries, cube building is not improving
mances of a single DB back-end to a cluster composed of 4 performances such Q2. The corresponding MDX state-
DB back-ends. We also report performance results for N=4, ments include new members calculus (i.e., measures or
with usage of derived data, namely, named sets), or perform filtering on levels’ properties.
This constrains the system to build a new pivot table
• Aggregate tables for business questions Q1, Q3-Q8, for the query,
Q12-Q20, and Q22. The corresponding OLAP cubes’
sizes are scale factor independent or very sparse (e.g. • For SF=10, most cubes allow fast data retrieval af-
C15 and C18). ter their deployment. Nevertheless, the system under
test was unable to build cubes related to business
• Derived attributes for business questions Q2, Q9-Q11 questions: Q3, Q9, Q10, Q13, Q18 and Q20, either
and Q21. The corresponding OLAP cubes’ sizes are for memory leaks or systems constraints. Overall, for
scale factor dependent. For these business questions, SF=10, improvements vary from 42.78% to 100%, for
Aggregate tables tend to be very big. Q1, Q4-Q8, Q12-Q14, Q16-Q17, Q19 and Q21.
The hardware system configuration used for performance • OLAP* demonstrates good performances for N=4.
measurements are Suno nodes located at Sophia site of Indeed, except Q3 and Q9 for which, the system was
GRID5000. Each node has 32 GB of memory, its CPUs are unable to run MDX statements for memory leaks,
Intel Xeon E5520, 2.27 GHz, with 2 CPUs per node and the rest of queries were improved through parallel

924
MDX Workload (sec) Parallel MDX Workload (sec) Parallel+ MDX Workload (sec)
Query Workload Cube-then-Query Workload Query Workload Cube-then-Query Workload Query Workload Cube-then-Query Workload
Cube Query Cube Query Cube Query
Q1 2,147.33 2,778.49 0.29 485.73 862.77 0.19 1.10 1.32 0.25
Q2 1,598.54 2,346.92 1,565.51 1,720.2 985.07 1,896.03 n/a∗1 n/a∗1 -
∗1 ∗1 ∗2 ∗2 ∗2
Q3 n/a n/a - n/a n/a - n/a 2,106.23 n/a∗2
Q4 1,657.60 7,956.45 5.33 523.67 1,657 1.54 0.06 0.07 0.05
Q5 54.53 3,200.64 0.46 12.96 1,219.19 0.19 0.12 0.99 0.06
Q6 282.11 371.80 0.53 72.58 131.70 0.37 0.42 0.77 0.37
Q7 260.23 617.20 0.06 36.01 195.24 0.06 0.08 0.95 0.06
Q8 50.63 2,071.00 4.61 13.38 716.10 2.70 0.07 3.83 0.23
∗1 ∗1 ∗1 ∗1 ∗2 ∗2
Q9 n/a n/a - n/a n/a - n/a n/a -
Q10 7,100.24 n/a∗2 - 2,654.20 13,674.02 1,599.47 127.67 9545.68 5.16
Q11 2,558.21 3,020.27 1,604.10 535.75 990.75 505.2 587.99 875.33 497.67
Q12 456.81 735.67 123.43 223.6 467.9 45.7 0.06 0.13 0.06
Q13 n/a∗2 n/a∗2 - n/a ∗2
n/a ∗2
- 0.08 0.16 0.05
Q14 391.06 946.16 0.06 112.41 356.8 0.05 0.06 0.13 0.05
Q15 13,005.27 32,064.90 12,413.74 2870.56 7,832.22 1,945.7 0.05 0.45 0.03
Q16 414.82 461.90 4.62 640.59 615.77 9.27 3.15 5.25 0.71
Q17 1,131.37 5,711.14 2.03 279.56 1,150.86 2.05 0.10 0.12 0.05
Q18 n/a∗2 n/a∗1 - 12,331.92 13,111.99 8,272.76 0.02 0.05 0.02
Q19 598.9 727.72 37.57 296.18 330.07 15.57 4.89 6.78 0.45
∗3
Q20 14,662.53 n/a - 11,842.90 n/a∗5 - 2909.71 n/a ∗6
-
Q21 578.09 855.46 0.15 185.10 272.39 0.21 2.04 25.12 0.71
Q22 68.74 402.16 39.33 8.19 98.71 13.67 6.7 60.4 3.67

TABLE V. P ERFORMANCE RESULTS OF OLAP* - MySQL/ Mondrian WITH TPC-H* D BENCHMARK FOR SF = 10: SINGLE DB BACK - END VS . 4 M Y SQL
DB BACK - ENDS .

cube building. Experimental results clearly confirm the


benefits deriving from our proposed framework.
• Response times of business questions of both work-
loads, for which aggregate tables were built, namely
Q1, Q3-Q8, Q12-Q20, and Q22, were improved. In-
deed, most cubes are built in a fraction of a second,
especially for those corresponding aggregate tables are
small (refer to Table VII for aggregate tables’ sizes).
• The impact of derived attributes is mitigated. Per-
formance results show good improvements for Q10
and Q21, and small impact on Q11. For Q2, the
system under test was unable to build the cube using
ps isminimum derived attributes.
• The calculus of derived data, namely aggregate tables
reported in Table VII and derived attributes in Table
VI is improved, except for not fragmented tables.

Single DB (sec) OLAP*, N = 4 (sec)


ps isminimum 862.40 862.40
l profit 4,377.51 1,288.31
o sumlostrevenue 1,027.98 217.71
n stockval 20.22 19.88
p sumqty, p countlines 1139,94 331.01
ps excess YYYY 18,195.48 1,461.99
s nbrwaitingorders 299.15 71.24

TABLE VI. D ERIVED ATTRIBUTES C ALCULUS FOR TPC-H* D WITH


SF = 10: SINGLE DB BACK - END VS . 4 M Y SQL DB BACK - ENDS .

III. A PPLICATION S CENARIO : V IRTUAL OLAP DATA


C UBE D ESIGN
Fig. 1. Example of Merge of OLAP Cubes
Having multiple and small cubes results in faster query
performance than one big cube. Nevertheless, it induces addi-
tional storage cost and CPU computing if the workload is run

925
Nbr of Rows Data Volume Single DB (sec) OLAP*, N = 4 (sec)
agg C1 129 16.62KB 343.91 71.63
agg C3 2,210,908 103.32MB 173.45 39.52
agg C4 135 5.22KB 138.45 32.92
agg C5 4,375 586.33KB 822.29 198.6
agg C6 1,680 84.67 KB 148.29 42.67
agg C7 4,375 372.70KB 720.26 187.8
agg C8 131,250 12.77MB 2894.38 818.82
agg C12 49 3.15KB 186.68 43.94
agg C13 721 26.33KB 9,819.46 2,272.45
agg C14 84 6.33KB 367.88 146.76 Fig. 3. MDX Statements for building OLAP Cubes 5 and 7 using their
agg C15 28 3.84KB 10,904.00 852.84
respective Virtual Cubes
agg C16 187,495 10.03MB 63.05 62.26
agg C17 1,000 45.92KB 3,180.26 435.52
agg C18 624 37.56KB 905.16 212.32
agg C19 854,209 80.65MB 88.57 26.10 A. Virtual Cube Example
agg C22 25 1.73KB 6.25 1.15
Virtual Cubes are recommended for minimal maintenance
TABLE VII. AGGREGATE TABLES BUILDING TIMES FOR TPC-H* D
WITH SF = 10: SINGLE DB BACK - END VS . 4 M Y SQL DB BACK - ENDS .
cost of OLAP cubes. They allow finding out shared and
relevant materialized pre-computed multidimensional cubes.
We implemented AutoMDB for recommending merge of OLAP
cubes based on maximum shared properties and minimum
different properties [20]. For instance, AutoMDB detects that,
(i) OLAP cubes C5 and C7 have the same fact table
LINEITEM. Moreover, (ii) both cubes calculate the same
measure sum(l extendedprice × (1 − l discount)), and (iii)
two dimensions of OLAP cube C7 could be collapsed within
dimensions of OLAP cube C5. Figure 1, illustrates both
dimensions sets of OLAP cubes C5 and C7, as well cube
C 5 7 resulting from the merge of cubes C5 and C7.
Notice that both OLAP cubes C5 and C7 sizes are equal to
4375 (25 × 25 × 7), respectively the number of customer’s na-
tions multiplied by the number of suppliers’ nations multiplied
by the number of orders’ dates years for C5; and the number
of customer nations multiplied by the number of suppliers’
nations multiplied by the number of line ship date years for
C7. Nevertheless, the size of Cube C 5 7 is equal to 30,625
(25 × 25 × 7 × 7), which is the number of customer’s nations
multiplied by the number of suppliers’ nations multiplied by
the number of orders’ dates years multiplied by the number
of line ship date years. The size of Cube C 5 7 is 3.5 times
the size of both OLAP cubes.

B. Performance Analysis
Table VIII reports performance measurements conducted
Fig. 2. Definition of Virtual OLAP Cubes
for evaluating the cost of merge of OLAP cubes, and running
MDX statements related to virtual Cubes VC5 and VC7. The
latter are illustrated in Figure 3. Performance measurements
show that the time for building C 5 7 is higher than the ones
against OLAP cubes having same fact table and shared dimen- related to C5 and C7. Nevertheless, building virtual cubes
sions. A virtual cube represents a subset of a physical cube. following the physical cube allows a gain in performance
Virtual Cubes are recommended for minimal maintenance cost whether is the order (i.e., C5 then C7 or the inverse).
of OLAP cubes. They allow finding out shared and rele-
vant materialized pre-computed multidimensional cubes. The IV. C ONCLUSIONS AND F UTURE W ORK
pairwise-comparisons of N OLAP cubes results into N ×(N 2
−1)
Following our previous research result [1], in this paper
comparisons. In order to automate OLAP cubes comparisons,
we have provided two authoritative application scenarios that
we implemented AutoMDB [20]. AutoMDB parses an XML
build on top of OLAP*, a middleware for parallel processing
description of TPC-H*d OLAP cubes. Then, builds matrices,
of OLAP queries that truly realizes effective and efficiently
which show the similarities and the differences for each pair
OLAP over Big Data. We have provided two authoritative case
of OLAP cubes. Similarities and differences are based on
studies, namely parallel OLAP data cube processing and vir-
comparing fact tables, and counting (i) the number of shared
tual OLAP data cube design, for which we have also provided
dimensions, (ii) the number of different dimensions, (iii) the
a comprehensive performance evaluation and analysis.
number of possibly coalesce-able dimensions, (iv) the number
of shared measures, (v) the number of different measures, and Future work is mainly devoted to two important aspects:
(vi) the number of possibly derivate measures. (i) integrating novel data cube compression approaches (e.g.,

926
Initial schema (sec) Virtual Cubes (sec)
C5 3,200.64 0.7
C7 617.20 0.2
C 5 7 - 3,457.7
TABLE VIII. V IRTUAL OLAP CUBES (VC5 AND VC7) BUILDING PERFORMANCES AFTER OLAP CUBE C 5 7 BUILDING .

[21]) in order to speed-up efficiency; (ii) further stressing [11] F. Akal, K. Böhm, and H.-J. Schek, “Olap query evaluation in a
the fragmentation phase by integrating emerging intelligent database cluster: A performance study on intra-query parallelism,” in
Proceedings of the 6th East European Conference on Advances in
fragmentation techniques, even developed in related scientific Databases and Information Systems, ser. ADBIS’02, 2002, pp. 218–
areas (e.g., [22]). 231.
[12] A. A. B. Lima, M. Mattoso, and P. Valduriez, “Adaptive virtual
R EFERENCES partitioning for olap query processing in a database cluster,” JIDM,
[1] A. Cuzzocrea and R. Moussa, “A cloud-based framework for supporting vol. 1, no. 1, pp. 75–88, 2010.
effective and efficient OLAP in big data environments,” in 2014 [13] U. Röhm, K. Böhm, and H.-J. Schek, “Olap query routing and physical
14th IEEE/ACM International Symposium on Cluster, Cloud and Grid design in a database cluster,” in EDBT, 2000, pp. 254–268.
Computing, Chicago, IL, USA, May 26-29, 2014, 2014, pp. 680–684. [14] L. Bellatreche and K. Y. Woameno, “Dimension table driven approach
[2] D. Agrawal, S. Das, and A. El Abbadi, “Big data and cloud com- to referential partition relational data warehouses,” in Proceedings of the
puting: current state and future opportunities,” in Proceedings of the ACM twelfth international workshop on Data warehousing and OLAP,
14th International Conference on Extending Database Technology, ser. ser. DOLAP’09. ACM, 2009, pp. 9–16.
EDBT/ICDT’11. ACM, 2011, pp. 530–533. [15] M. Steinbrunn, G. Moerkotte, and A. Kemper, “Heuristic and random-
[3] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, ized optimization for the join ordering problem,” The VLDB Journal,
S. Anthony, H. Liu, and R. Murthy, “Hive - a petabyte scale data vol. 6, no. 3, pp. 191–208, Aug. 1997.
warehouse using hadoop,” in ICDE, 2010, pp. 996 – 1005. [16] T. Stöhr, H. Martens, and E. Rahm, “Multi-dimensional database
[4] A. Cuzzocrea, “Retrieving accurate estimates to olap queries over allocation for parallel data warehouses,” in Proceedings of the 26th
uncertain and imprecise multidimensional data streams,” in SSDBM, International Conference on Very Large Data Bases, ser. VLDB’00,
2011, pp. 575–576. 2000, pp. 273–284.
[5] A. Cuzzocrea, L. Bellatreche, and I. Song, “Data warehousing and [17] T. Stöhr and E. Rahm, “Warlock: A data allocation tool for parallel
OLAP over big data: current challenges and future research directions,” warehouses,” in Proceedings of the 27th International Conference on
in Proceedings of the sixteenth international workshop on Data ware- Very Large Data Bases, ser. VLDB ’01, 2001, pp. 721–722.
housing and OLAP, DOLAP 2013, San Francisco, CA, USA, October [18] A. A. Lima, C. Furtado, P. Valduriez, and M. Mattoso, “Parallel olap
28, 2013, 2013, pp. 67–70. query processing in database clusters with data replication,” in Distrib.
[6] A. Cuzzocrea, “Data warehousing and olap over big data: A survey of Parallel Databases, vol. 25, 2009, pp. 97–123.
the state-of-the-art, open problems and future challenges,” International [19] JPivot, “JSP-based OLAP client,” http://jpivot.sourceforge.net/, 2013.
Journal of Business Process Integration and Management, vol. to [20] TPC-H*d, “Multidimensional TPC-H benchmark,”
appear, 2015. https://sites.google.com/site/rimmoussa/auto multidimensional dbs,
[7] Transaction Processing Council, “TPC-H benchmark,” 2013.
http://www.tpc.org/tpch, 2013. [21] A. Cuzzocrea, D. Saccà, and P. Serafino, “A hierarchy-driven compres-
[8] Pentaho, “Mondrian ROLAP Server,” http://mondrian.pentaho.org/, sion technique for advanced OLAP visualization of multidimensional
2013. data cubes,” in Data Warehousing and Knowledge Discovery, 8th
[9] A. Cuzzocrea, D. Saccà, and J. D. Ullman, “Big data: a research International Conference, DaWaK 2006, Krakow, Poland, September
agenda,” in 17th International Database Engineering & Applications 4-8, 2006, Proceedings., 2006, pp. 106–119.
Symposium, IDEAS ’13, Barcelona, Spain - October 09 - 11, 2013, [22] A. Bonifati and A. Cuzzocrea, “Storing and retrieving xpath fragments
2013, pp. 198–203. in structured P2P networks,” Data Knowl. Eng., vol. 59, no. 2, pp.
[10] D. J. DeWitt, S. Madden, and M. Stonebraker, “How to build a 247–269, 2006.
high-performance data warehouse,” http://db.lcs.mit.edu/madden/high
perf.pdf, 2005.

927

You might also like