OLAP Query Processing in Grids: Kotowski@cos - Ufrj.br, Esther - Pacitti@univ-Nantes - FR, Patrick - Valduriez@inria - FR

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

OLAP Query Processing in Grids ∗

Nelson Kotowski 1, Alexandre A. B. Lima3, Esther Pacitti2, Patrick Valduriez2, Marta


Mattoso1
1
COPPE/UFRJ, Rio de Janeiro, Brazil
2
Atlas Group, INRIA and LINA, University of Nantes, France
3
UNIGRANRIO, Rio de Janeiro, Brazil
kotowski@cos.ufrj.br, Esther.Pacitti@univ-nantes.fr, Patrick.Valduriez@inria.fr,
abento@unigranrio.com.br, marta@cos.ufrj.br

Abstract. OLAP query processing is critical for enterprise grids. Capitalizing on


our experience with the ParGRES database cluster, we propose a middleware
solution, GParGRES, which exploits database replication and inter- and intra-query
parallelism to efficiently support OLAP queries in a grid. GParGRES has been
partially implemented as database grid services on Grid5000. We give preliminary
experimental results obtained with two clusters of Grid5000 using queries of the
TPC-H Benchmark. The results show linear or almost linear speedup in query
execution, as more nodes are added in all tested configurations.

1. Introduction

Initially developed for the scientific community, Grid computing is now gaining much
interest in other areas such as enterprise information systems, thus making Grid data
management critical [19]. For instance, IBM, Oracle and Microsoft are all promoting tools
and services for enterprise grids. Data management in grids has been initially achieved
using distributed file systems. However, more general database solutions are needed to
enable the virtualization of distributed, autonomous databases using Web services and
provide transparent support for database queries [1], [3], [4], [22], [23]. Ideally, a grid
database solution must respect database autonomy (i.e. avoid database or application
migration) while taking advantage of distributed and parallel computing. This can be
achieved through the development of a middleware layer between the user applications
and the databases. Such a middleware should provide for distributed and parallel query
processing with non-intrusive techniques, considering DBMS as black-box components so
there is no need for database or application migration.


Work partially funded by CAPES-COFECUB (DAAD project), CNPq-INRIA (GriData project),
French ANR Massive Data (Respire project) and the European Strep Grid4All project.
An important kind of database queries is OLAP which tends to access massive amounts
of data and is thus time consuming. The typical solution to efficient OLAP query
processing is to exploit inter- and intra-query parallelism using a parallel database system
on a multiprocessor system or a cluster of PC (e.g. Oracle’s Real Application Cluster).
However, this solution requires heavy migration of the existing databases and applications
to the parallel database system. A cost-effective alternative which respects database
autonomy is database clusters [8], [17], [21]. By replicating (parts of) the database on
cluster (PC) nodes, each node running a black-box DBMS, a database cluster can provide
inter- and intra-query processing in a non-intrusive way through a middleware layer. For
instance, our database cluster middleware ParGRES [17] provides transparent inter- and
intra-query processing to applications accessing any DBMS that supports SQL-99 and has
a JDBC driver for client connections.
In this paper, we propose a middleware solution to OLAP query processing in grids,
called GParGRES, which capitalizes on ParGRES to provide transparent inter- and intra-
query processing. We consider a typical grid environment with multiple clusters at
different sites. Thus, database replication and parallel query processing must be addressed
at two levels: grid level and cluster level. Compared to the database cluster approach
where the database is replicated at a single site, GParGRES enables the database to be
replicated at multiple sites of the grid, thus increasing data availability and quality of
service. For instance, if one grid site is unavailable (e.g. for maintenance), it is still
possible to run OLAP queries using other sites.
GParGRES has been partially implemented as grid services on Grid5000 [12], a large
and flexible configurable grid platform in France. We give preliminary experimental
results obtained with two clusters of Grid5000 using queries of the TPC-H Benchmark
[24]. The results show linear or almost linear speedup in query execution, as more nodes
are added in all tested configurations.
The paper is organized as follows. Section 2 introduces ParGRES. Section 3 presents
GParGRES. Section 4 gives experimental results. Section 0 presents related work. Section
6 concludes.

2. ParGRES

ParGRES is a database cluster middleware which exploits inter- and intra-query


parallelism during query processing. The only requirement for the DBMS is that it
supports SQL-99. Parallelism is obtained through full database replication and Adaptive
Virtual Partitioning (AVP) [14]. ParGRES eases database migration from centralized
environments since no new physical database design is required. It provides flexibility
with respect to node allocation for query processing: any query can be processed by any
set of cluster nodes. AVP provides for dynamic load balancing among cluster nodes
during query processing in a non-intrusive way.
Similar to other database clusters, ParGRES manages the parallel execution of queries
using DBMS instances at cluster nodes. However, database clusters like C-JDBC [8] and
PowerDB [21] use a centralized layer executed at a single node which acts as coordinator.
To avoid such centralized bottleneck, ParGRES performs decentralized control and its
components are distributed among cluster nodes (see Figure 1) [15].
There are global and local components. Global components (i.e.) Mediator and Cluster
Query Processor (CQP)) execute tasks that involve several cluster nodes. Local
components (i.e. Node Query Processor (NQP) and DBMS) execute tasks in one node.
CQP coordinates all other components in the context of a query. The Mediator is
responsible for receiving requests from the applications, passing them to the CQP, and
passing back CQP responses to the applications. Since most clusters have a single node
accessible to external applications (the entry node), the Mediator component is typically
allocated at this node. Thus, only the Mediator component is centralized which gives it
full flexibility in the physical allocation of CQP for each request, thus improving the
overall environment availability. NQP locally coordinates query execution at the DBMS
and helps CQP during load balancing.
ParGRES executes four types of tasks:
(i) SQL query parsing. CQP contains a syntactic analyzer to parse SQL commands
from the client application. It uses a context-free grammar for SQL-99. Commands not
parsed by this grammar are sent directly to the DBMS. The information generated
includes: (i) a set of relations and attributes referenced by the query that may be used to
obtain intra-query parallelism; (ii) information needed to perform result composition; (iii)
a set of attributes used in aggregation operations.
(ii) Query processing with inter/intra-query parallelism. CQP is responsible for
choosing the type of parallelism used during query processing and allocating nodes that
will be used for it. It uses information from the Catalog, which stores the metadata needed
to implement AVP only. No specific information from the DBMS is needed, thus
preserving database autonomy and considering each DBMS as a “black-box” component.
Inter-query parallelism is relatively straightforward. CQP sends the query to the NQP
of the node with the smallest number of pending tasks. The intra-query parallel strategy
decomposes complex queries into sub-queries that will be executed in parallel over
different data fragments. CQP rewrites the original query into sub-queries. Those sub-
queries are a version of the original query containing a predicate that determines the
ranges of the virtual partitions, required by AVP. They are sent to NQPs, which
adaptively fine tune the virtual partitions locally. Each NQP executes its subquery and
generates a partial result, which is sent to the CQP. After receiving partial results from all
NQPs, CQP finishes the result composition and sends it to the client application.
Non-uniform data distribution can lead to load skew. ParGRES’s non-intrusive
dynamic load balancing technique addresses this issue. NQPs perform balancing by
exchanging messages among themselves to redefine virtual partitions. The results in [15]
show the technique is very efficient, especially for cases of extreme skew.
(iii) Result composition. ParGRES does result composition in a two-phase
aggregation. It uses parallel processing in this composition, thus minimizing
communication between nodes. In the first phase, the nodes aggregate the groups returned
by the local sub-queries. In the second phase, the groups are distributed to their respective
nodes through a hash function. Finally, each node sends its subset of the global result to
the coordinator node, which executes their union. Sort operations are also done in parallel
in a similar fashion.
(iv) Update processing. Although ParGRES focuses on read-only query processing,
typical of OLAP, updates may also be performed. Since updates in OLAP environments
are usually fast and executed at predefined times, ParGRES adopts a strong consistency
policy: it does not allow the concurrent execution of updates and queries. To implement
this policy, ParGRES has a scheduler that orders queries and updates. While updates are
processed, all the remaining queries coming from the application are blocked. When there
are just read-only queries, CQP allows them to execute in parallel.

3. GParGRES: a Database Grid Middleware

GParGRES is a middleware to transparently access distributed databases in a grid. As its


name suggests, it is based on ParGRES and shares the same objective of efficient support
of OLAP through inter/intra-query parallelism. In a grid, the databases are managed by
DBMS that are orchestrated by ParGRES instances. GParGRES is a layer on top of
ParGRES instances. Our approach has two levels of query splitting: grid-level splitting,
implemented by GParGRES, and node-level splitting, implemented by ParGRES. We
assume that each grid node is a PC cluster.
GParGRES is designed as a wrapper that enables the use of ParGRES in a grid (in our
case, Grid5000). Figure 1 shows GParGRES’s architecture. Its main components are
described as follows. We are working on using established standards for the development
of grid applications, e.g. [18], for the implementation of GParGRES. We discuss some of
these issues after presenting our architecture.
Registry Service (RS) – concentrates information concerning GParGRES services,
such as the state of each FS and DQS instance, and ParGRES execution in the nodes.
Factory Service (FS) – responsible for creating new instances of DQS. When a client
application intends to submit queries to GParGRES, it initially asks FS to create a new
DQS instance. Each new instance receives a unique service identifier that associates it
with its respective factory. Such an identifier is not reused for other new instances even
when the service is finished. According to [10], this behavior is one of the basic
characteristics of grid services.
Distributed Query Service (DQS) – this service directly interacts with client
applications. DQS receives queries and splits them into subqueries to implement intra-
query parallelism using an approach similar to ParGRES. It takes advantage of database
replication to perform virtual partitioning. Such partitioning generates adaptive virtual
partitions (AVP) to be processed in parallel, similar to ParGRES´ CQP. This service also
performs final result composition.
Grid Local Query Service (GLQS) – local component responsible for receiving
subqueries from DQS and passing them to the local ParGRES. This service monitors
subquery execution on ParGRES to allow for query redistribution if the node is too busy
or redirect the subquery to another node in case of failure.

Figure 1 – GParGRES architecture with ParGRES on detail


ParGRES performs node-level load balancing while GParGRES performs grid-level
load balancing, using information about the subqueries being processed in each grid node
involved in the global query execution.
Let us now discuss the implementation of GParGRES as grid services and their
compatibility with existing grid solutions, such as the Open Grid Service Architecture
(OGSA) [18], in particular, OGSA-Data Access and Integration (OGSA-DAI) and Web
Services Resource Framework (WSRF).
Registry Service (RS) – RS can be implemented as a WSRF-compliant [27]. It is
possible to use the WSDL specification adopted by the Index Service described in MDS4
(WS MDS – Web Service – Monitoring and Discovery System) as our basis to RS in
collecting metadata about grid computing resources. In Grid5000, it is supplied by the
Ganglia tool that also follows MDS4.
Factory Service (FS) – To create new instances of DQS, FS can be implemented with
the help of GDSF – Grid Data Service Factory - OGSA-DAI service, as they have similar
functions, i.e., they create instances of the main service of an application.
Grid Local Query Service (GLQS) – This service can take advantage of the GDS
(Grid Data Service) service specified by OGSA-DAI [2], which is able to interact with a
data resource. In our case, GDS can be used to interact with the ParGRES instance
running on a grid node. ParGRES would act as a relational DBMS to GDS.
4. Experimental Results

To validate our approach, we implemented GParGRES and did a first performance


evaluation based on experiments with Grid5000 [7], [12], a French project that creates a
large-scale reconfigurable grid infrastructure to support distributed and parallel
experiments. Today, Grid5000 has nine sites spread over France. Each site is itself a
cluster of PC, all of them are interconnected by a high-speed network. One main
advantage of Grid5000 is that it is possible to reconfigure each grid node for experiments.
With the Kadeploy [13] tool, it is possible to generate customized images of operating
systems and applications (e.g., DBMS), store and automatically or interactively load them
through the job scheduler tool OAR [7], [11]. This way, the time required to setup the
environment needed for experiments is reduced through the use of Grid5000 official tools.
Also, Grid5000 has a tool (Ganglia [16]) that supplies information concerning the
activities of each grid node, from CPU load to hard disk consumption.
Our preliminary experiments are performed on two clusters (located in Rennes). The
Parasol cluster has 64 nodes, each with 2 Opteron 2.2GHz CPUs, 2GB RAM and 73 GB
HD. The Paraquad cluster has 64 nodes, each with 2 Dual Core Xeon 2.33GHz CPUs,
4GB RAM and 160GB HD. The clusters are interconnected by 1 Gbps network links,
which reduces the costs of message communication.
In all experiments, we used the Kadeploy tool to generate an image of the 64 bits
Debian OS (available on Grid5000) along with the PostgreSQL 8.2.4 [20] DBMS and a
ParGRES instance for each cluster node. Jobs were interactively submitted through OAR.
The use of ParGRES assumes a computational environment of three layers: the
application layer (i.e. an OLAP tool), the ParGRES layer, and the database layer (with a
DBMS instance in each node of the cluster). Each DBMS accesses its local database, in
which the data cubes have already been generated and are ready to be queried. We assume
the OLAP client tool acquires data using SQL.
ParGRES is developed in Java. The communication between ParGRES and each
DBMS is done through JDBC, and between each internal module through RMI.
Our tests are based on the TPC-H benchmark [24] which is representative of ad-hoc
OLAP applications. We generated the database according to the TPC-H specifications
with a scale factor of 1 using the DBMS PostgreSQL 8.2.4, which gave us a database of
approximately 2.2GB (including all indexes). Clustered indexes based on the first attribute
of the primary key were generated for each fact table (Orders and LineItem). They are
necessary for AVP, implemented by ParGRES. Indexes were also generated for all other
primary and foreign keys. No other indexes were created, as required by TPC-H.
The TPC-H queries used for the preliminary tests of GParGRES are Q1, Q5, Q6, Q12,
Q14 and Q18. We restrict our analysis to these queries due to space limitations. They have
different levels of complexity and are quite representative of OLAP applications.
We performed two kinds of experiments (see Figure 2 and Figure 3). In the first
experiment, the clusters are isolated and each query is entirely processed by each one.
These experiments make it possible to evaluate the prototype performance in each cluster,
just as ParGRES. Each query is run ten times for each cluster configuration. Then, we
take the average time of the last nine runs (the first one is not considered).

(a) (b)
Figure 2 - Results with Isolated Clusters

Figure 3 - Mixed Configuration Results


The results obtained with the Parasol and Paraquad clusters are shown in Figure 2.
When considering one node-only query processing, the results in Paraquad are better
because the cluster is more powerful. However, as more nodes are added (so virtual
fragments are more likely to be found in memory), the results of the two clusters tend to
be very close. This demonstrates the effectiveness of ParGRES’s non-intrusive approach
and thus that of GParGRES.
In our second kind of experiments, both clusters process the same set of queries. For
the configuration with 1 node, we use an NQP running in Paraquad and the CQP running
in Parasol, in order not to eliminate the inter-cluster communication costs. We call this
Mixed Configuration and the experimental results are shown in Figure 3.
Both kinds of experiments show similar performance with a slight improvement during
the second one (which still does not characterize it as the best scenario). However, it can
be inferred that message passing costs are not significant for the final performance.
Furthermore, the results obtained with GParGRES achieve linear or almost linear speedup
in query execution, as more nodes are added in all tested configurations. And this is
obtained without any DBMS, network or machine specific optimization. Thus, these are
very encouraging results which make GParGRES an attractive solution for OLAP support
in grids.
5. Related Work

OGSA-DAI [2] is a middleware based on OGSA to provide access to relational and XML
databases in grids. OGSA-DAI is complementary rather than alternative to GParGRES as
it does not provide services for query processing. It provides a standard way to send a
query to a grid data resource and obtain its corresponding result. OGSA-DAI can be used
as a basis for GParGRES implementation, as some of its services are useful to those
presented in GParGRES architecture.
An alternative for distributed query processing in grids is through OGSA-DQP [3],
which is based on OGSA-DAI. However, OGSA-DQP does not automatically provide for
intra-query parallelism at the operator level. GParGRES provides for intra-operator
parallelism as it supports data partitioning, which is important for OLAP query
processing. With intra-operator parallelism, the same operator (e.g. join operator) is
executed in parallel by GParGRES using several ParGRES instances which process
different data subsets. Our solution is thus better for OLAP queries.
Some related works propose new data models for data warehouses in grids, e.g. [26].
Using data fragmentation [5], physical fragments of the data warehouse are distributed
among grid nodes. Then, grid services are built to identify and index such fragments.
Some grid services are also proposed for generating distributed query execution plans One
main advantage of GParGRES is to provide intra-query parallelism without requiring any
physical fragmentation of the database. In addition, GParGRES works with standard
relational DBMS while the approach presented in [26] does not make it clear if data are
stored in relational databases or in flat files.
Finally, the approach proposed in [26] uses spatial indexes based on X-Trees. This
index structure is not commonly found in standard DBMS, since it requires special-
purpose implementation. GParGRES does not require any special index structure. It
requires only standard clustered ordered indexes, which can be found in many DBMS.

6. Conclusion

In this paper, we proposed GParGRES, a middleware for OLAP query processing in grids.
GParGRES capitalizes on our previous work on ParGRES, a database cluster middleware,
to provide transparent inter- and intra-query processing without compromising database
and application autonomy. Considering a typical grid environment with multiple clusters
at different sites, our approach has two levels of query splitting: grid-level splitting,
implemented by GParGRES, and node-level splitting, implemented by ParGRES.
Furthermore, GParGRES enables the database to be replicated at multiple sites of the grid,
thus increasing data availability and quality of service. GParGRES can be implemented
as grid services and is compatible with existing grid solutions, in particular OGSA.
GParGRES has been partially implemented as grid services on Grid5000. We gave
preliminary experimental results obtained with two clusters of Grid5000 using queries of
the TPC-H Benchmark. The results show linear or almost linear speedup in query
execution, as more nodes are added in all tested configurations.
These are very encouraging results which make GParGRES an attractive solution for
OLAP application support in grids. Besides more substantial performance experiments as
was done for ParGRES, this work can be extended in several interesting directions. The
first one is to provide support for partial replication [9] (as opposed to full replication)
which is required for very large databases. Another promising direction is the support of
top-k queries [1], another important kind of queries whose support in grids has not yet
received attention. In particular, extending best position algorithms [1] to work in
GParGRES is a challenging problem.

References

1. Akbarinia, R., Pacitti E., and Valduriez, P.: Best Position Algorithms for Top-k Queries. In:
VLDB, Vienna, Austria (2007)
2. Anjomshoaa, A., et al.: The Design and Implementation of Grid Database Services in OGSA-
DAI. Concurrency and Computation: Practice and Experience, 17(2-4), 357--376 (2005)
3. Alpdemir, M. N., et al.: Service-Based Distributed Querying on the Grid. In: ICSOC 2003.
LNCS, vol. 2910, pp. 467--483. Springer (2003)
4. Bell, W. H., Bosio, D., Hoschek, W., Kunszt, P., McCance, G. and Silander, M.: Project
Spitfire – Towards Grid Web Service Databases. Technical Report, GGF, (2002)
5. Bellatreche, L., Karlapalem K. and Mohania M.: OLAP Query Processing for Partitioned Data
Warehouses. In: DANTE, pp. 35--42. IEEE (1999)
6. Berchtold, S., Keim D. A. and Kriegel, H. –P.: The X-tree: An Index Structure for High-
Dimensional Data. In: VLDB, pp. 28--39. Mumbai (1996)
7. Cappello, F., Desprez, F., Dayde, M., et al.: Grid5000: a large scale and highly reconfigurable
Grid experimental testbed. In: Int. Workshop on Grid Computing, pp. 99--106. IEEE (2005)
8. Cecchet, E., Marguerite, J., and Zwaenepoel, W.: C-JDBC: Flexible Database Clustering
Middleware. In: Freenix 2004, pp. 9--18. USENIX Association, Boston (2004)
9. Furtado, C., Lima, A., Pacitti, E., Valduriez, P. and Mattoso, M.: Physical and Virtual
Partitioning in OLAP Database Clusters. In: SBAC, pp. 143—150. Rio de Janeiro (2005)
10. Foster, I., Kesselman, C. and Tuecke, S.: The Anatomy of the Grid. Enabling Scalable Virtual
Organizations. International Journal of High Performance Computing Applications, 15(3), 200-
-222, (2001)
11. Georgiou, Y., Richard O., Neyron P., Huard G. and Martin C.: A batch scheduler with high
level components. In: CCGRID’2005, pp. 776--783. IEEE, Cardiff, (2005)
12. Grid5000, http://www.grid5000.fr
13. Kadeploy, http://kadeploy.imag.fr/
14. Lima, A. A. B., Mattoso, M. and Valduriez, P.: Adaptive Virtual Partitioning for OLAP Query
Processing in a Database Cluster. In: SBBD, pp. 92--105. Brasília, (2004)
15. Lima, A. A. B.: Intra-Query parallelism in database clusters. Ph.D. Thesis, COPPE/UFRJ, Rio
de Janeiro, (2004)
16. Massie, M. N., Chun, B. N. and Culler, D. E.: The Ganglia Distributed Monitoring System:
design, implementation, and experience. Parallel Computing, 30(7), 817--840, (2004)
17. Mattoso, M., Lima, A. A. B., et al: ParGRES: a middleware for executing OLAP queries in
parallel. Technical Report, http://pargres.nacad.ufrj.br/Documentos/ES-690.pdf, (2005).
18. Open Grid Services Architecture, http://www.globus.org/ogsa
19. Pacitti, E., Valduriez, P. and Mattoso, M.: Grid Data Management: open problems and new
issues. Journal of Grid Computing, 5(3) (2007).
20. PostgreSQL, http://www.postgresql.org
21. Röhm, U., Böhm, K., Schek, H.-J., et al.: FAS - A Freshness-Sensitive Coordination
Middleware for a Cluster of OLAP Components. In: VLDB, pp. 754--765. Hong Kong, (2002)
22. Santisteban, M. A. N., Gray, J., Szalay, A. S., Annis, J., Thakar, A. R. and O’Mullane, W. J.:
When Database Systems Meet the Grid. In: CIDR, pp. 154--161. California, (2004)
23. Smith, J., Gounaris, A., Watson, P., Paton, N. W., Fernandes, A. A. A. and Sakellariou, R.:
Distributed Query Processing on the Grid. In: GRID. LNCS, vol. 2536, pp. 279--290, Springer
(2002)
24. TPC-H Benchmark, http://www.tpc.org
25. Watson, P.: Databases and the Grid. Technical Report, UK e-Science (2003)
26. Wehrle, P., Miquel M. and Tchounikine, A.: A Model for Distributing and Querying a Data
Warehouse on a Computing Grid. In: ICPADS, pp. 203--209. IEEE, Fukuoka (2005)
27. Web Services Resource Framework, http://www.oasis-open.org/committees/wsrf/

You might also like