Professional Documents
Culture Documents
OLAP Query Processing in Grids: Kotowski@cos - Ufrj.br, Esther - Pacitti@univ-Nantes - FR, Patrick - Valduriez@inria - FR
OLAP Query Processing in Grids: Kotowski@cos - Ufrj.br, Esther - Pacitti@univ-Nantes - FR, Patrick - Valduriez@inria - FR
OLAP Query Processing in Grids: Kotowski@cos - Ufrj.br, Esther - Pacitti@univ-Nantes - FR, Patrick - Valduriez@inria - FR
1. Introduction
Initially developed for the scientific community, Grid computing is now gaining much
interest in other areas such as enterprise information systems, thus making Grid data
management critical [19]. For instance, IBM, Oracle and Microsoft are all promoting tools
and services for enterprise grids. Data management in grids has been initially achieved
using distributed file systems. However, more general database solutions are needed to
enable the virtualization of distributed, autonomous databases using Web services and
provide transparent support for database queries [1], [3], [4], [22], [23]. Ideally, a grid
database solution must respect database autonomy (i.e. avoid database or application
migration) while taking advantage of distributed and parallel computing. This can be
achieved through the development of a middleware layer between the user applications
and the databases. Such a middleware should provide for distributed and parallel query
processing with non-intrusive techniques, considering DBMS as black-box components so
there is no need for database or application migration.
∗
Work partially funded by CAPES-COFECUB (DAAD project), CNPq-INRIA (GriData project),
French ANR Massive Data (Respire project) and the European Strep Grid4All project.
An important kind of database queries is OLAP which tends to access massive amounts
of data and is thus time consuming. The typical solution to efficient OLAP query
processing is to exploit inter- and intra-query parallelism using a parallel database system
on a multiprocessor system or a cluster of PC (e.g. Oracle’s Real Application Cluster).
However, this solution requires heavy migration of the existing databases and applications
to the parallel database system. A cost-effective alternative which respects database
autonomy is database clusters [8], [17], [21]. By replicating (parts of) the database on
cluster (PC) nodes, each node running a black-box DBMS, a database cluster can provide
inter- and intra-query processing in a non-intrusive way through a middleware layer. For
instance, our database cluster middleware ParGRES [17] provides transparent inter- and
intra-query processing to applications accessing any DBMS that supports SQL-99 and has
a JDBC driver for client connections.
In this paper, we propose a middleware solution to OLAP query processing in grids,
called GParGRES, which capitalizes on ParGRES to provide transparent inter- and intra-
query processing. We consider a typical grid environment with multiple clusters at
different sites. Thus, database replication and parallel query processing must be addressed
at two levels: grid level and cluster level. Compared to the database cluster approach
where the database is replicated at a single site, GParGRES enables the database to be
replicated at multiple sites of the grid, thus increasing data availability and quality of
service. For instance, if one grid site is unavailable (e.g. for maintenance), it is still
possible to run OLAP queries using other sites.
GParGRES has been partially implemented as grid services on Grid5000 [12], a large
and flexible configurable grid platform in France. We give preliminary experimental
results obtained with two clusters of Grid5000 using queries of the TPC-H Benchmark
[24]. The results show linear or almost linear speedup in query execution, as more nodes
are added in all tested configurations.
The paper is organized as follows. Section 2 introduces ParGRES. Section 3 presents
GParGRES. Section 4 gives experimental results. Section 0 presents related work. Section
6 concludes.
2. ParGRES
(a) (b)
Figure 2 - Results with Isolated Clusters
OGSA-DAI [2] is a middleware based on OGSA to provide access to relational and XML
databases in grids. OGSA-DAI is complementary rather than alternative to GParGRES as
it does not provide services for query processing. It provides a standard way to send a
query to a grid data resource and obtain its corresponding result. OGSA-DAI can be used
as a basis for GParGRES implementation, as some of its services are useful to those
presented in GParGRES architecture.
An alternative for distributed query processing in grids is through OGSA-DQP [3],
which is based on OGSA-DAI. However, OGSA-DQP does not automatically provide for
intra-query parallelism at the operator level. GParGRES provides for intra-operator
parallelism as it supports data partitioning, which is important for OLAP query
processing. With intra-operator parallelism, the same operator (e.g. join operator) is
executed in parallel by GParGRES using several ParGRES instances which process
different data subsets. Our solution is thus better for OLAP queries.
Some related works propose new data models for data warehouses in grids, e.g. [26].
Using data fragmentation [5], physical fragments of the data warehouse are distributed
among grid nodes. Then, grid services are built to identify and index such fragments.
Some grid services are also proposed for generating distributed query execution plans One
main advantage of GParGRES is to provide intra-query parallelism without requiring any
physical fragmentation of the database. In addition, GParGRES works with standard
relational DBMS while the approach presented in [26] does not make it clear if data are
stored in relational databases or in flat files.
Finally, the approach proposed in [26] uses spatial indexes based on X-Trees. This
index structure is not commonly found in standard DBMS, since it requires special-
purpose implementation. GParGRES does not require any special index structure. It
requires only standard clustered ordered indexes, which can be found in many DBMS.
6. Conclusion
In this paper, we proposed GParGRES, a middleware for OLAP query processing in grids.
GParGRES capitalizes on our previous work on ParGRES, a database cluster middleware,
to provide transparent inter- and intra-query processing without compromising database
and application autonomy. Considering a typical grid environment with multiple clusters
at different sites, our approach has two levels of query splitting: grid-level splitting,
implemented by GParGRES, and node-level splitting, implemented by ParGRES.
Furthermore, GParGRES enables the database to be replicated at multiple sites of the grid,
thus increasing data availability and quality of service. GParGRES can be implemented
as grid services and is compatible with existing grid solutions, in particular OGSA.
GParGRES has been partially implemented as grid services on Grid5000. We gave
preliminary experimental results obtained with two clusters of Grid5000 using queries of
the TPC-H Benchmark. The results show linear or almost linear speedup in query
execution, as more nodes are added in all tested configurations.
These are very encouraging results which make GParGRES an attractive solution for
OLAP application support in grids. Besides more substantial performance experiments as
was done for ParGRES, this work can be extended in several interesting directions. The
first one is to provide support for partial replication [9] (as opposed to full replication)
which is required for very large databases. Another promising direction is the support of
top-k queries [1], another important kind of queries whose support in grids has not yet
received attention. In particular, extending best position algorithms [1] to work in
GParGRES is a challenging problem.
References
1. Akbarinia, R., Pacitti E., and Valduriez, P.: Best Position Algorithms for Top-k Queries. In:
VLDB, Vienna, Austria (2007)
2. Anjomshoaa, A., et al.: The Design and Implementation of Grid Database Services in OGSA-
DAI. Concurrency and Computation: Practice and Experience, 17(2-4), 357--376 (2005)
3. Alpdemir, M. N., et al.: Service-Based Distributed Querying on the Grid. In: ICSOC 2003.
LNCS, vol. 2910, pp. 467--483. Springer (2003)
4. Bell, W. H., Bosio, D., Hoschek, W., Kunszt, P., McCance, G. and Silander, M.: Project
Spitfire – Towards Grid Web Service Databases. Technical Report, GGF, (2002)
5. Bellatreche, L., Karlapalem K. and Mohania M.: OLAP Query Processing for Partitioned Data
Warehouses. In: DANTE, pp. 35--42. IEEE (1999)
6. Berchtold, S., Keim D. A. and Kriegel, H. –P.: The X-tree: An Index Structure for High-
Dimensional Data. In: VLDB, pp. 28--39. Mumbai (1996)
7. Cappello, F., Desprez, F., Dayde, M., et al.: Grid5000: a large scale and highly reconfigurable
Grid experimental testbed. In: Int. Workshop on Grid Computing, pp. 99--106. IEEE (2005)
8. Cecchet, E., Marguerite, J., and Zwaenepoel, W.: C-JDBC: Flexible Database Clustering
Middleware. In: Freenix 2004, pp. 9--18. USENIX Association, Boston (2004)
9. Furtado, C., Lima, A., Pacitti, E., Valduriez, P. and Mattoso, M.: Physical and Virtual
Partitioning in OLAP Database Clusters. In: SBAC, pp. 143—150. Rio de Janeiro (2005)
10. Foster, I., Kesselman, C. and Tuecke, S.: The Anatomy of the Grid. Enabling Scalable Virtual
Organizations. International Journal of High Performance Computing Applications, 15(3), 200-
-222, (2001)
11. Georgiou, Y., Richard O., Neyron P., Huard G. and Martin C.: A batch scheduler with high
level components. In: CCGRID’2005, pp. 776--783. IEEE, Cardiff, (2005)
12. Grid5000, http://www.grid5000.fr
13. Kadeploy, http://kadeploy.imag.fr/
14. Lima, A. A. B., Mattoso, M. and Valduriez, P.: Adaptive Virtual Partitioning for OLAP Query
Processing in a Database Cluster. In: SBBD, pp. 92--105. Brasília, (2004)
15. Lima, A. A. B.: Intra-Query parallelism in database clusters. Ph.D. Thesis, COPPE/UFRJ, Rio
de Janeiro, (2004)
16. Massie, M. N., Chun, B. N. and Culler, D. E.: The Ganglia Distributed Monitoring System:
design, implementation, and experience. Parallel Computing, 30(7), 817--840, (2004)
17. Mattoso, M., Lima, A. A. B., et al: ParGRES: a middleware for executing OLAP queries in
parallel. Technical Report, http://pargres.nacad.ufrj.br/Documentos/ES-690.pdf, (2005).
18. Open Grid Services Architecture, http://www.globus.org/ogsa
19. Pacitti, E., Valduriez, P. and Mattoso, M.: Grid Data Management: open problems and new
issues. Journal of Grid Computing, 5(3) (2007).
20. PostgreSQL, http://www.postgresql.org
21. Röhm, U., Böhm, K., Schek, H.-J., et al.: FAS - A Freshness-Sensitive Coordination
Middleware for a Cluster of OLAP Components. In: VLDB, pp. 754--765. Hong Kong, (2002)
22. Santisteban, M. A. N., Gray, J., Szalay, A. S., Annis, J., Thakar, A. R. and O’Mullane, W. J.:
When Database Systems Meet the Grid. In: CIDR, pp. 154--161. California, (2004)
23. Smith, J., Gounaris, A., Watson, P., Paton, N. W., Fernandes, A. A. A. and Sakellariou, R.:
Distributed Query Processing on the Grid. In: GRID. LNCS, vol. 2536, pp. 279--290, Springer
(2002)
24. TPC-H Benchmark, http://www.tpc.org
25. Watson, P.: Databases and the Grid. Technical Report, UK e-Science (2003)
26. Wehrle, P., Miquel M. and Tchounikine, A.: A Model for Distributing and Querying a Data
Warehouse on a Computing Grid. In: ICPADS, pp. 203--209. IEEE, Fukuoka (2005)
27. Web Services Resource Framework, http://www.oasis-open.org/committees/wsrf/