Professional Documents
Culture Documents
A High - Performance Computing Method For Data Allocation in Distributed Database Systems
A High - Performance Computing Method For Data Allocation in Distributed Database Systems
DOI 10.1007/s11227-006-0001-8
C Springer Science + Business Media, LLC 2007
Abstract Enhancing the performance of the DDBs (Distributed Database system) can be
done by speeding up the computation of the data allocation, leading to higher speed allocation
decisions and resulting in smaller data redundancy and shorter processing time. This paper
deals with an integrated method for grouping the distributed sites into clusters and customiz-
ing the database fragments allocation to the clusters and their sites. We design a high speed
clustering and allocating method to determine which fragments would be allocated to which
cluster and site so as to maintain data availability and a constant systemic reliability, and
evaluate the performance achieved by this method and demonstrate its efficiency by means
of tabular and graphical representation. We tested our method over different network sites and
found it reduces the data transferred between the sites during the execution time, minimizes
the communication cost needed for processing applications, and handles the database queries
and meets their future needs.
M. Ramachandran
e-mail: M.ramachandran@lmu.ac.uk
N. Bowring
Manchester Metropolitan University, Department of Engineering & Technology,
Manchester, M1 5GD, UK
e-mail: N.Bowring@mmu.ac.uk
Springer
4 I. O. Hababeh et al.
1 Introduction
Considerable progress has been made in the last few years in improving the performance
of the distributed database systems. The development of data allocation models in DDBs
is becoming difficult due to the complexity of huge number of sites and their communi-
cation considerations. Under such conditions, simulation of clustering and data allocation
are adequate tools for understanding and evaluating the performance of data allocation in
DDBs. Clustering sites and fragment allocation are key challenges in DDBs performance,
and are considered to be efficient methods that have a major role in reducing transferred
and accessed data during the execution of the applications. Clustering is a method of group-
ing sites according to a certain criterion to increase the system I/O performance. Frag-
ment allocation technique describes the way in which the database fragments are distributed
among the clusters and their respective sites in DDBs, attempts to minimize the commu-
nication costs by distributing the global database over the sites, increases availability and
reliability where multiple copies of the same data are allocated, and reduces the storage
overheads.
Significant efforts have been made to minimize the amount of data transferred during
processing the applications [1] and to reduce the amount of irrelevant data accessed by the
applications [2]. Several kinds of distributed approaches have been implemented that describe
the clustering and data allocation in the distributed database systems. Some approaches have
an exponential time complexity for their allocating methods [3–5]. Some methods don’t
have fragment replications over the DDBs sites, which decrease the availability and integrity
[6–8]. Other approaches are applied on specific types of network connectivity or a small
number of sites [9–11].
This paper presents a high-performance approach that computes the decision values for
fragments allocation in DDBs, and shows a logical way for grouping sites into clusters to
which fragments would be allocated. We will describe a way for grouping the sites of DDBs
into clusters and allocating fragments to the clusters and their respective sites. This approach
improves the performance of the applications execution. It does so by minimizing the access
cost of fragments, reducing the amount of data transferred during the run time, speeding
up the database system by maximizing the degree of parallel execution, and increasing data
availability and integrity by allocating multiple copies (replication) of the same fragment
over the sites where possible.
The remainder of this paper is organized as follows. A description of our clustering
technique will be done in Section 2. In Section 3, we will describe how to allocate fragments
to the clusters and then to their sites. Performance evaluation will be done in Section 4.
Finally, in Section 5, we will make some concluding remarks.
2 Clustering sites
Grouping sites into clusters improves the system performance by reducing the number of
communications during the processes of fragments allocation to the DDBs clusters and their
sites. In this section, we introduce a simulation for complex number of network communica-
tions by generating clusters based on the communication cost between the sites. We cite here
the clustering method that assigns all sites in the DDBs network to small number of clusters
(groups) in order to reduce the number of communications, and thus to improve the system
performance. Figure 1 shows the possible number of communications between 6 distributed
sites over the DDBs network.
Springer
A high-performance computing method for data allocation in distributed database systems 5
SITE #1
Site #2
SITE
SITE #5
#6
Network
SITE
SITE #6
#6
Site #4
Site #6
SITE #3
This method categorises the sites based on a clustering decision value that determines
whether or not a site can be grouped in a specific cluster. Basically, the clustering decision
value is specified through two elements: the communication cost between the current site
and the other sites in the DDBs, and the communication cost range that the site should match
to be grouped in a specific cluster. The following subsections detail the basics of our method.
2.1 Communication cost between the current site and the other sites in the DDBs
network
We define the cost of communication between each pair of sites Si and S j in the DDBs as
CC(Si , S j ). For simplicity, we assume the following communication cost rules:
r The communication cost between the same sites is symmetric; CC(Si , S j ) is equal to
CC(S j , Si );
r The communication cost in the same site is equal to zero;
r The communication cost is proportional to the distance between the sites.
Table 1 presents an example of the communication cost between the six sites in the proposed
DDBs measured in communication units.
S1 0 2 10 7 10 7
S2 2 0 6 7 8 12
S3 10 6 0 2 6 11
S4 7 7 2 0 9 7
S5 10 8 6 9 0 1
S6 7 12 11 7 1 0
Springer
6 I. O. Hababeh et al.
The first row of data represents the communication cost between site 1 and other sites in
the network, the communication cost in site 1 is considered to be zero, the communication
cost between site 1 and site 2 is equal to 2 units, and the communication costs between site
1 and the sites 3, 4, 5, 6 are equal to 10, 7, 10, and 7 units respectively. The remaining data
rows represent the communication cost between the other sites in the same manner.
From our point of view, we define “CCR” as the number of communication units which
are allowed for the maximum difference of communication cost between any two sites to be
grouped in the same cluster. This number is determined by the DDBs network administrators.
We define the clustering decision value “CDV” as a logical value to determine if each pair of
sites Si and S j in the DDBs network can be grouped in one cluster or not. CDV is computed
as a result of the comparison between the cost of communication CC(Si , S j ) between the
sites Si and S j and the communication cost range CCR. We describe CDV in the following
formula.
1; CC(Si , S j ) <= CCR ∧ i = j
cdv(Si , S j ) =
0; CC(Si , S j ) > CCR ∨ i = j
Accordingly, if cdv(Si , S j ) is equal to 1, then the sites Si and S j are assigned to one cluster,
otherwise they are assigned to different clusters.
In order to set up an efficient clustering method, we define the following clustering algo-
rithm based on the basics described above.
In the fragment allocation phase, as we will show in Section 3, we initially allocate frag-
ments to the clusters generated by this algorithm and have applications using that fragments.
Allocating fragments to a small number of clusters instead of large number of sites will
reduce the number of communications and therefore speed up the system performance.
To demonstrate the applicability of clustering, we simulate our clustering algorithm
on the communication cost for all pairs of sites in Table 1 above, when for example the
Springer
A high-performance computing method for data allocation in distributed database systems 7
S1 0 1 0 0 0 0
S2 1 0 0 0 0 0
S3 0 0 0 1 0 0
S4 0 0 1 0 0 0
S5 0 0 0 0 0 1
S6 0 0 0 0 1 0
C1 S1 S2 – – – –
C2 – – S3 S4 – –
C3 – – – – S5 S6
Cluster #1
Site #1 Site #2
Cluster #3
Cluster #2
Site #3
Network
Site #5
Site #6
Site #4
communication cost range CCR is equal to 5 units. Table 2 shows the cluster entry matrix of
the cluster decision values for all site pairs.
From the previous cluster entry matrix, we found that only site 1 and site 2 can be grouped
in one cluster because the cluster decision value for both (S1, S2) and (S2, S1) is equal to
1, the remaining sites S3, S4, S5, and S6 can’t match the CCR with S1 and S2. Since the
cluster decision value for both (S3, S4) and (S4, S3) is equal to 1 and the sites S1, S2, S5,
and S6 are far from S3 and S4, thus sites 3 and 4 can be only grouped in the second cluster.
In the same way, the third cluster is constructed only from sites 5 and 6. Table 3 displays the
generated clusters and their respective sites.
Figure 2 demonstrates the generated clusters and the possible number of communications
between them.
The communication costs within and between clusters have to be taken into account in
the computation of the fragment allocation as will be clarified in Section 3. In our opinion,
computing the average communication cost for each cluster has less time complexity than
computing the least communication cost based on some kinds of sorting. Therefore, we
Springer
8 I. O. Hababeh et al.
C1 2 7.5 9.25
C2 7.5 2 8.25
C3 9.25 8.25 1
consider the cluster average communication cost for the computation of fragment allocation
and we assume symmetric communication cost between clusters.
We compute the average communication cost within and between clusters as defined in
the following formula:
The average communication cost within cluster 1, for example, is computed as follows:
the sum of communication cost between (S1, S2) which is equal to 4 units (see Table 1)
divided by the number of communications between (S1, S2) which is equal to 2. Thus, the
average communication cost within cluster 1 is equal to 2.
In the same way, we find the communication cost between cluster 1 and cluster 2, for
example, as follows: the sum of communication costs between (S1, S3), (S1, S4), (S2, S3),
(S2, S4), (S3, S1), (S4, S1), (S3, S2) and (S4, S2) which is equal to 60 (10 + 7 + 6 + 7 + 10 +
7 + 6 + 7 from Table 1 above) divided by the number of communications from the sites of
cluster 1 to the sites of cluster 2 which is equal to 8. Therefore, the average communication cost
between cluster 1 and cluster 2 is equal to 7.5. Table 4 shows the computed communication
costs between the DDBs clusters.
3 Fragment allocation
Distributing database fragments among clusters and their sites improves the DDBs perfor-
mance by minimizing the data transferred and accessed during the execution time, reducing
the storage overheads, and increasing availability and reliability where multiple copies of the
same data are allocated.
We introduce a computational method to determine the fragment allocation in the DDBs
based on the query processing cost functions that find out precisely whether the fragment is
allocated to or cancelled from each cluster and its sites. The detail basics of our method are
described in the following points.
r Initial fragment allocation
We start allocating fragments to all clusters having applications using these fragments.
If they show positive (true) allocation decision values, then we allocate the fragments
to all sites of these clusters and see if they also show a true allocation decision values,
then we keep the fragments into these sites. Otherwise, we cancel the fragments from the
clusters and we don’t allocate them to their sites. For example, if fragment 1 is used by
the applications in sites 1 and 2, then we allocate this fragment to cluster 1 instead of
allocating the fragment to its sites (sites 1 and 2). Next, if cluster 1 proves true allocation
decision value, then we allocate the fragment to sites 1 and 2, after that we maintain
this fragment in the site that proves true allocation decision value. On the other hand, if
Springer
A high-performance computing method for data allocation in distributed database systems 9
cluster 1 proves false allocation decision value, then we cancel fragment 1 from this cluster
and we don’t allocate it to its sites.
r Allocation decision value
The allocation decision value is computed as a logical value for the comparison between
the cost of not allocating the fragment to the cluster/site (the fragment handled remotely)
and the cost of allocating the fragment to the cluster/site. If the cost of not allocating the
fragment to the cluster/site is greater than or equals to the cost of allocating the fragment
to the cluster/site, then the allocation decision value is positive (true) and the fragment is
allocated to the cluster/site. On the other hand, if the cost of not allocating is less than the
cost of allocating, then the allocation decision value is (false) and the fragment is cancelled
from the cluster/site.
r Cost of allocating the fragment to the cluster/site
The cost of allocating the fragment to the cluster/site is defined by the following costs:
local retrievals, local updates, space, remote update, and remote communication.
r Cost of not allocating the fragment to the cluster/site
The cost of not allocating the fragment to the cluster/site is defined by costs of local and
remote retrievals.
In the next subsections we detail our fragment allocation method.
Initially, we allocate the fragments to all clusters using these fragments. The allocation
decision value ADV(Tk , Fi , C j ) of allocating a fragment Fi issued by the transaction Tk to a
cluster C j , is computed as the result of the difference between CN(Tk , Fi , C j ) the cost of not
allocating the fragment Fi issued by the transaction Tk to the cluster C j and CA(Tk , Fi , C j )
the cost of allocating the fragment Fi issued by the transaction Tk to the cluster C j .
The cost of allocating the fragment Fi to the cluster C j is computed as the sum of the following
costs: local retrievals, local updates, space, remote update, and remote communication.
r Cost of local retrievals is equal to the average cost of local retrievals at cluster C j multiply
by the average number of frequency of retrieval issued by the transaction Tk to the fragment
Fi at cluster C j .
r Cost of local updates is equal to the average cost of local updates at cluster C j multiply by
the average number of frequency of update issued by the transaction Tk to the fragment Fi
at the cluster C j .
r Cost of space is equal to the cost of space occupied by the fragment Fi in the cluster C j
multiply by the size of the fragment Fi (in bytes).
r Remote update is equal to remote updates sent from other clusters C x ; the average cost of
local updates at cluster C j multiply by the average number of frequency of update issued
by the transaction Tk to the fragment Fi for each cluster except the current one.
The cost of not allocating the fragment Fi to the cluster C j is computed as the sum of the
cost of local retrievals and sum of the cost of remote retrievals.
r Cost of local retrievals is equal to the average cost of local retrievals at cluster C j multiply
by the average number of frequency of retrieval issued by the transaction Tk to the fragment
Fi at the cluster C j . It is defined in previous Section 3.1.1.
r Cost of remote retrievals is equal to remote retrievals from other clusters C x ; the retrieval
ratio (Rratio = Unit Retrieval/Unit Communication) multiply by the average number of
frequency of retrieval issued by the transaction Tk to the fragment Fi at the cluster C j for
each cluster except the current one multiply by the average cost of communication between
clusters.
According to the previous formulas the cost of not allocation CN (Tk , Fi , C j ) is defined as
Springer
A high-performance computing method for data allocation in distributed database systems 11
We define the allocation decision value ADV (Tk , Fi , C j ) that allocates the fragment Fi issued
by the transaction Tk to the cluster C j as a logical value and compute it as follows:
1; C N (Tk , Fi , C j ) ≥ CA(Tk , Fi , C j )
ADV(Tk , Fi , C j ) = (9)
0; C N (Tk , Fi , C j ) < CA(Tk , Fi , C j )
If the allocation decision value is true (1), then the fragment is permanently allocated to the
cluster C j . On the other hand, if the allocation decision value is false (0), then the fragment
is cancelled from the cluster C j .
Based on the basics and formulas described above, we define our fragment allocation
algorithm as follows:
F1 C1 S1 80 10
S2 60 26
C2 S3 60 16
S4 0 0
C3 S5 35 5
S6 25 5
F2 C2 S3 20 4
S4 20 6
C3 S5 5 30
S6 105 20
F3 C1 S1 0 20
S2 0 10
C2 S3 30 0
S4 0 0
C3 S5 40 30
S6 30 10
F4 C1 S1 10 20
S2 10 20
C2 S3 65 12
S4 5 12
F5 C1 S1 70 20
S2 6 10
C2 S3 20 10
S4 20 10
C3 S5 35 10
S6 45 20
F6 C1 S1 0 10
S2 0 0
C3 S5 25 5
S6 5 5
F7 C2 S3 25 5
S4 35 10
C3 S5 10 0
S6 30 0
F8 C1 S1 10 20
S2 80 20
C2 S3 20 0
S4 60 10
C4 S5 0 20
S6 20 0
To illustrate the simulation of our fragment allocation method and demonstrate its applicabil-
ity, performance and usefulness, we propose the following input data: eight fragments to be
distributed over the clusters generated in the previous section, number of retrieval and update
frequencies requested from each cluster and its sites, costs of space, retrieval, and update in
each site for each cluster, and the size of each unit of retrieval, update and communication.
Tables 5–7 show the proposed data.
The results obtained from our allocation algorithm are the allocated and cancelled frag-
ments in all clusters which are described in Table 8.
Springer
A high-performance computing method for data allocation in distributed database systems 13
Figure 3 shows the distribution of the proposed fragments over the clusters.
To determine the fragments that would be allocated to the sites, we will apply our fragment
allocation algorithm for the sites of each cluster allocated by those fragments. Initially, the
fragments are allocated to all sites having applications using those fragments, and thus the
allocation decision values of allocating the fragments to the sites are computed in the same
way as illustrated in Section 3.1 taking into account the data of the sites instead of the data of
clusters. The fragments are permanently allocated to or cancelled from the sites according to
Springer
14 I. O. Hababeh et al.
Allocated Fragments #:
1, 5, 8
Cluster #1
Site #1 Site #2
Site #3
Network
Site #5
Site #6
Site #4
their allocation decision values. If the allocation decision value is true, then the fragment is
permanently allocated to the site. On the other hand, if the allocation decision value is false,
then the fragment is cancelled from the site.
4 Performance evaluation
We have studied the clustering and the fragment allocation in DDBs and performed an exten-
sive experimental analysis on our algorithms whose characteristics are reported in Sections
2 and 3. In the following subsections we detail the performance evaluations obtained by our
clustering and fragment allocation algorithms.
Number of communications. We think that grouping sites into clusters reduces the num-
ber of communications which will result in minimizing the communication costs that are
needed in further processes through fragment allocation phase. In the example introduced in
Section 2 where we simulate a network of six sites, each site is communicated with the other
5 sites, so the initial total number of communications is (6 * 5 = 30). After clustering the
sites into three clusters, each cluster is communicated with the other 2 clusters taking into
account the communication within the cluster itself because it contains sites with different
communication costs; in this case the total number of communications is 9. Therefore, a
high-performance is achieved by using our clustering method, which reduces the number of
communications from 30 to 9 and enhances the system progress by 70% as described in the
following formula:
Communication
40
Cost
20
0
C1 C2 C1 C3 C2 C3
Communication between Clusters
Table 9 Performance
improvement of allocating Initial # of allocated Final # of allocated
fragments to clusters Cluster# fragments fragment Improvement
C1 6 3 +50 %
C2 7 3 +57.1 %
C3 7 5 +28.6 %
Cost of communications. We used the average communication cost between clusters and
their sites because the time complexity for computing the average communication cost is
less when other methods are used which depend on sorting sites to find the least com-
munication cost. In this case the initial communication cost between the sites in clus-
ter 1 and the sites in cluster 2, for example, which is equal to 60 (10 + 7 + 6 + 7 +
10 + 7 + 6 + 7 from Table 1 above), is replaced by the average of the communication
cost between cluster 1 and 2 (C1, C2) and between cluster 2 and cluster 1 (C2, C1)
which is equal to 15 (7.5 + 7.5 from table 3 above). Thus, the communication cost is
reduced from 60 to 15 and the performance improvement is computed in the following
formula:
We can say that a high-performance is also achieved and the system enhanced by 75%.
The communication costs between the other clusters (C1, C3) and (C2, C3) are also
minimized in the same way and thus enhance the system performance by about the same
percent. Figure 4 depicts the performance achieved by our clustering method.
We think that the system performance is enhanced by removing the redundant fragments
from the database clusters and their sites, and by increasing availability and reliability where
multiple copies of the same fragments are allocated. This will reduce the communication
costs where the fragments are needed frequently.
Springer
16 I. O. Hababeh et al.
Number of Fragments
6
5
4
3
2
1
0
C1 C2 C3
Cluster Number
S1 3 2 +33.3%
S2 3 2 +33.3%
S3 3 3 0%
S4 3 2 +33.3%
S5 5 4 +20%
S6 5 4 +20%
Statistics related to the performance of our fragment allocation method can be collected by
analyzing the information generated from Table 8 in Section 3.1 and summarized in Table 9.
Initially, allocating fragments to all clusters having applications requesting these fragments
generates a total of 20 allocations, while 11 allocations are generated after applying our
fragment allocation. In this case the system performance improved by an average of 45% as
defined in the following formula:
Figure 5 shows the improvement of the system performance achieved by our fragment
allocation method over the clusters.
We generate 22 allocations after clustering and allocating fragments initially to all sites
having applications requesting these fragments, while 17 allocations are generated when we
perform our fragment allocation method into the sites. The following formula defines the
performance improvement for allocating the fragments to the sites after clustering.
Number of Fragments
5
0
S1 S2 S3 S4 S5 S6
Site Number
5 Conclusion
In this paper we presented an integrated method for clustering and fragment allocation in
DDBs. Our clustering method was able to minimize the number of communications and
the communication costs between the sites. An experimental analysis was conducted over
the DDBs sites and showed best performance for grouping sites into clusters. The results
obtained from our fragment allocation method demonstrated high performance, and enhanced
the database applications progress in a network environment by optimizing the fragment
allocation and by increasing availability and reliability where multiple copies of the same
fragments are allocated. This approach can be implemented in different network environments
even if the input parameters are very large.
References
1. Karlapalem K, Navathe S, Morsi M (1994) Issues in distribution design of object oriented databases.
Distributed Object Management, Morgan Kaufmann Publishers
2. Ezeife C, Barker K (1998) Distributed object based design: vertical fragmentation of classes. Int J Distr
Parallel Databases, 6(4):327–360. Kluwer Academic Publishers.
3. Yee W, Donahoo M, Navathe S (2000) A framework for server data fragment grouping to improve server
scalability in intermittently synchronized databases. CIKM.
4. Huang Y, Chen J (2001) Fragment allocation in distributed database design. J Inf Sci Eng 17:491–506
5. Cheng C, Lee W, Wong K (2002) A genetic algorithm-based clustering approach for database partitioning.
IEEE Trans Syst Man Cybern—Part C: Appl Rev 32(3)
Springer
18 I. O. Hababeh et al.
6. Lim S, Kai Y (1997) Vertical fragmentation and allocation in distributed deductive database systems. Inf
Syst 22(1):1–24
7. Hwang S, Yang C (1998) Component and data distribution in a distributed workflow management system.
IEEE Soft Eng Conf 244–251
8. Son J, Kim M (2003) An adaptable vertical partitioning method in distributed systems. J Syst Soft. Elsevier
9. Daudpota N (1998) Five steps to construct a model of data allocation for distributed database systems. J
Intell Inf Syst: Integr Artif Intell Database Technol 11(2):153–68
10. Lee H, Park Y, Jang G, Huh S (2000) Designing a distributed database on a local area network: a method-
ology and decision support system. Inf Soft Technol 42:171–184
11. Tamhankar A, Ram S (1998) Database fragmentation and allocation: an integrated methodology and case
study. IEEE Trans Syst, Man Cybern—Part A. Syst Hum 28(3):288–305
Springer