Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

J Supercomput (2007) 39:3–18

DOI 10.1007/s11227-006-0001-8

A high-performance computing method for data


allocation in distributed database systems

Ismail Omar Hababeh · Muthu Ramachandran ·


Nicholas Bowring


C Springer Science + Business Media, LLC 2007

Abstract Enhancing the performance of the DDBs (Distributed Database system) can be
done by speeding up the computation of the data allocation, leading to higher speed allocation
decisions and resulting in smaller data redundancy and shorter processing time. This paper
deals with an integrated method for grouping the distributed sites into clusters and customiz-
ing the database fragments allocation to the clusters and their sites. We design a high speed
clustering and allocating method to determine which fragments would be allocated to which
cluster and site so as to maintain data availability and a constant systemic reliability, and
evaluate the performance achieved by this method and demonstrate its efficiency by means
of tabular and graphical representation. We tested our method over different network sites and
found it reduces the data transferred between the sites during the execution time, minimizes
the communication cost needed for processing applications, and handles the database queries
and meets their future needs.

Keywords High-performance data allocation . Communication cost . Clustering .


Fragment allocation . Simulation . Performance evaluation

I. O. Hababeh () . M. Ramachandran


Leeds Metropolitan University, Faculty of Information and Technology, School of Computing Leeds,
LS6 3QS, UK
e-mail: Ismail.Hababeh@uaeu.ac.ae

M. Ramachandran
e-mail: M.ramachandran@lmu.ac.uk

N. Bowring
Manchester Metropolitan University, Department of Engineering & Technology,
Manchester, M1 5GD, UK
e-mail: N.Bowring@mmu.ac.uk
Springer
4 I. O. Hababeh et al.

1 Introduction

Considerable progress has been made in the last few years in improving the performance
of the distributed database systems. The development of data allocation models in DDBs
is becoming difficult due to the complexity of huge number of sites and their communi-
cation considerations. Under such conditions, simulation of clustering and data allocation
are adequate tools for understanding and evaluating the performance of data allocation in
DDBs. Clustering sites and fragment allocation are key challenges in DDBs performance,
and are considered to be efficient methods that have a major role in reducing transferred
and accessed data during the execution of the applications. Clustering is a method of group-
ing sites according to a certain criterion to increase the system I/O performance. Frag-
ment allocation technique describes the way in which the database fragments are distributed
among the clusters and their respective sites in DDBs, attempts to minimize the commu-
nication costs by distributing the global database over the sites, increases availability and
reliability where multiple copies of the same data are allocated, and reduces the storage
overheads.
Significant efforts have been made to minimize the amount of data transferred during
processing the applications [1] and to reduce the amount of irrelevant data accessed by the
applications [2]. Several kinds of distributed approaches have been implemented that describe
the clustering and data allocation in the distributed database systems. Some approaches have
an exponential time complexity for their allocating methods [3–5]. Some methods don’t
have fragment replications over the DDBs sites, which decrease the availability and integrity
[6–8]. Other approaches are applied on specific types of network connectivity or a small
number of sites [9–11].
This paper presents a high-performance approach that computes the decision values for
fragments allocation in DDBs, and shows a logical way for grouping sites into clusters to
which fragments would be allocated. We will describe a way for grouping the sites of DDBs
into clusters and allocating fragments to the clusters and their respective sites. This approach
improves the performance of the applications execution. It does so by minimizing the access
cost of fragments, reducing the amount of data transferred during the run time, speeding
up the database system by maximizing the degree of parallel execution, and increasing data
availability and integrity by allocating multiple copies (replication) of the same fragment
over the sites where possible.
The remainder of this paper is organized as follows. A description of our clustering
technique will be done in Section 2. In Section 3, we will describe how to allocate fragments
to the clusters and then to their sites. Performance evaluation will be done in Section 4.
Finally, in Section 5, we will make some concluding remarks.

2 Clustering sites

Grouping sites into clusters improves the system performance by reducing the number of
communications during the processes of fragments allocation to the DDBs clusters and their
sites. In this section, we introduce a simulation for complex number of network communica-
tions by generating clusters based on the communication cost between the sites. We cite here
the clustering method that assigns all sites in the DDBs network to small number of clusters
(groups) in order to reduce the number of communications, and thus to improve the system
performance. Figure 1 shows the possible number of communications between 6 distributed
sites over the DDBs network.
Springer
A high-performance computing method for data allocation in distributed database systems 5

SITE #1

Site #2
SITE
SITE #5
#6

Network

SITE
SITE #6
#6
Site #4
Site #6

SITE #3

Fig. 1 Number of communications between 6 distributed sites

This method categorises the sites based on a clustering decision value that determines
whether or not a site can be grouped in a specific cluster. Basically, the clustering decision
value is specified through two elements: the communication cost between the current site
and the other sites in the DDBs, and the communication cost range that the site should match
to be grouped in a specific cluster. The following subsections detail the basics of our method.

2.1 Communication cost between the current site and the other sites in the DDBs
network

We define the cost of communication between each pair of sites Si and S j in the DDBs as
CC(Si , S j ). For simplicity, we assume the following communication cost rules:
r The communication cost between the same sites is symmetric; CC(Si , S j ) is equal to
CC(S j , Si );
r The communication cost in the same site is equal to zero;
r The communication cost is proportional to the distance between the sites.

Table 1 presents an example of the communication cost between the six sites in the proposed
DDBs measured in communication units.

Table 1 Communication costs


between DDBs sites SITE # S1 S2 S3 S4 S5 S6

S1 0 2 10 7 10 7
S2 2 0 6 7 8 12
S3 10 6 0 2 6 11
S4 7 7 2 0 9 7
S5 10 8 6 9 0 1
S6 7 12 11 7 1 0

Springer
6 I. O. Hababeh et al.

The first row of data represents the communication cost between site 1 and other sites in
the network, the communication cost in site 1 is considered to be zero, the communication
cost between site 1 and site 2 is equal to 2 units, and the communication costs between site
1 and the sites 3, 4, 5, 6 are equal to 10, 7, 10, and 7 units respectively. The remaining data
rows represent the communication cost between the other sites in the same manner.

2.2 Communication cost range (CCR)

From our point of view, we define “CCR” as the number of communication units which
are allowed for the maximum difference of communication cost between any two sites to be
grouped in the same cluster. This number is determined by the DDBs network administrators.

2.3 Clustering decision value (CDV)

We define the clustering decision value “CDV” as a logical value to determine if each pair of
sites Si and S j in the DDBs network can be grouped in one cluster or not. CDV is computed
as a result of the comparison between the cost of communication CC(Si , S j ) between the
sites Si and S j and the communication cost range CCR. We describe CDV in the following
formula.

1; CC(Si , S j ) <= CCR ∧ i = j
cdv(Si , S j ) =
0; CC(Si , S j ) > CCR ∨ i = j

Accordingly, if cdv(Si , S j ) is equal to 1, then the sites Si and S j are assigned to one cluster,
otherwise they are assigned to different clusters.
In order to set up an efficient clustering method, we define the following clustering algo-
rithm based on the basics described above.

Input: Matrix of communication cost between sites CC(Si , S j )


CCR value
Nmax: Number of sites in DDBs network
Output: Set of clusters and their sites (cluster entry matrix)
Step 1: For I = 1, Nmax, do steps 2–8
Step 2: For J = 1, Nmax, do steps 3–7
Step 3: If I = J AND CC(Si , S j ) <= CCR, go to step 4
Else, go to step 5
Step 4: Set 1 to the cluster entry matrix, go to step 6
Step 5: Set 0 to the cluster entry matrix
Step 6: End IF
Step 7: End For
Step 8: End For
Step 9: Stop.

In the fragment allocation phase, as we will show in Section 3, we initially allocate frag-
ments to the clusters generated by this algorithm and have applications using that fragments.
Allocating fragments to a small number of clusters instead of large number of sites will
reduce the number of communications and therefore speed up the system performance.
To demonstrate the applicability of clustering, we simulate our clustering algorithm
on the communication cost for all pairs of sites in Table 1 above, when for example the
Springer
A high-performance computing method for data allocation in distributed database systems 7

Table 2 The cluster entry matrix


Site # S1 S2 S3 S4 S5 S6

S1 0 1 0 0 0 0
S2 1 0 0 0 0 0
S3 0 0 0 1 0 0
S4 0 0 1 0 0 0
S5 0 0 0 0 0 1
S6 0 0 0 0 1 0

Table 3 The generated clusters


and their sites Cluster # Cluster sites

C1 S1 S2 – – – –
C2 – – S3 S4 – –
C3 – – – – S5 S6

Cluster #1

Site #1 Site #2

Cluster #3
Cluster #2

Site #3

Network
Site #5

Site #6

Site #4

Fig. 2 The DDBs clusters

communication cost range CCR is equal to 5 units. Table 2 shows the cluster entry matrix of
the cluster decision values for all site pairs.
From the previous cluster entry matrix, we found that only site 1 and site 2 can be grouped
in one cluster because the cluster decision value for both (S1, S2) and (S2, S1) is equal to
1, the remaining sites S3, S4, S5, and S6 can’t match the CCR with S1 and S2. Since the
cluster decision value for both (S3, S4) and (S4, S3) is equal to 1 and the sites S1, S2, S5,
and S6 are far from S3 and S4, thus sites 3 and 4 can be only grouped in the second cluster.
In the same way, the third cluster is constructed only from sites 5 and 6. Table 3 displays the
generated clusters and their respective sites.
Figure 2 demonstrates the generated clusters and the possible number of communications
between them.
The communication costs within and between clusters have to be taken into account in
the computation of the fragment allocation as will be clarified in Section 3. In our opinion,
computing the average communication cost for each cluster has less time complexity than
computing the least communication cost based on some kinds of sorting. Therefore, we
Springer
8 I. O. Hababeh et al.

Table 4 The communication


costs between clusters Cluster # C1 C2 C3

C1 2 7.5 9.25
C2 7.5 2 8.25
C3 9.25 8.25 1

consider the cluster average communication cost for the computation of fragment allocation
and we assume symmetric communication cost between clusters.
We compute the average communication cost within and between clusters as defined in
the following formula:

Sum of the communication costs between clusters sites


Average communication cost =
Number of communications between clusters sites

The average communication cost within cluster 1, for example, is computed as follows:
the sum of communication cost between (S1, S2) which is equal to 4 units (see Table 1)
divided by the number of communications between (S1, S2) which is equal to 2. Thus, the
average communication cost within cluster 1 is equal to 2.
In the same way, we find the communication cost between cluster 1 and cluster 2, for
example, as follows: the sum of communication costs between (S1, S3), (S1, S4), (S2, S3),
(S2, S4), (S3, S1), (S4, S1), (S3, S2) and (S4, S2) which is equal to 60 (10 + 7 + 6 + 7 + 10 +
7 + 6 + 7 from Table 1 above) divided by the number of communications from the sites of
cluster 1 to the sites of cluster 2 which is equal to 8. Therefore, the average communication cost
between cluster 1 and cluster 2 is equal to 7.5. Table 4 shows the computed communication
costs between the DDBs clusters.

3 Fragment allocation

Distributing database fragments among clusters and their sites improves the DDBs perfor-
mance by minimizing the data transferred and accessed during the execution time, reducing
the storage overheads, and increasing availability and reliability where multiple copies of the
same data are allocated.
We introduce a computational method to determine the fragment allocation in the DDBs
based on the query processing cost functions that find out precisely whether the fragment is
allocated to or cancelled from each cluster and its sites. The detail basics of our method are
described in the following points.
r Initial fragment allocation
We start allocating fragments to all clusters having applications using these fragments.
If they show positive (true) allocation decision values, then we allocate the fragments
to all sites of these clusters and see if they also show a true allocation decision values,
then we keep the fragments into these sites. Otherwise, we cancel the fragments from the
clusters and we don’t allocate them to their sites. For example, if fragment 1 is used by
the applications in sites 1 and 2, then we allocate this fragment to cluster 1 instead of
allocating the fragment to its sites (sites 1 and 2). Next, if cluster 1 proves true allocation
decision value, then we allocate the fragment to sites 1 and 2, after that we maintain
this fragment in the site that proves true allocation decision value. On the other hand, if
Springer
A high-performance computing method for data allocation in distributed database systems 9

cluster 1 proves false allocation decision value, then we cancel fragment 1 from this cluster
and we don’t allocate it to its sites.
r Allocation decision value
The allocation decision value is computed as a logical value for the comparison between
the cost of not allocating the fragment to the cluster/site (the fragment handled remotely)
and the cost of allocating the fragment to the cluster/site. If the cost of not allocating the
fragment to the cluster/site is greater than or equals to the cost of allocating the fragment
to the cluster/site, then the allocation decision value is positive (true) and the fragment is
allocated to the cluster/site. On the other hand, if the cost of not allocating is less than the
cost of allocating, then the allocation decision value is (false) and the fragment is cancelled
from the cluster/site.
r Cost of allocating the fragment to the cluster/site
The cost of allocating the fragment to the cluster/site is defined by the following costs:
local retrievals, local updates, space, remote update, and remote communication.
r Cost of not allocating the fragment to the cluster/site
The cost of not allocating the fragment to the cluster/site is defined by costs of local and
remote retrievals.
In the next subsections we detail our fragment allocation method.

3.1 Allocating fragments to clusters

Initially, we allocate the fragments to all clusters using these fragments. The allocation
decision value ADV(Tk , Fi , C j ) of allocating a fragment Fi issued by the transaction Tk to a
cluster C j , is computed as the result of the difference between CN(Tk , Fi , C j ) the cost of not
allocating the fragment Fi issued by the transaction Tk to the cluster C j and CA(Tk , Fi , C j )
the cost of allocating the fragment Fi issued by the transaction Tk to the cluster C j .

3.1.1 Cost of allocating a fragment to a cluster: CA(Tk , Fi , C j )

The cost of allocating the fragment Fi to the cluster C j is computed as the sum of the following
costs: local retrievals, local updates, space, remote update, and remote communication.
r Cost of local retrievals is equal to the average cost of local retrievals at cluster C j multiply
by the average number of frequency of retrieval issued by the transaction Tk to the fragment
Fi at cluster C j .

CLRsum(Tk , Fi , C j ) = CLR(Tk , Fi , C j ) ∗ FREQLR(Tk , Fi , C j ) (1)

r Cost of local updates is equal to the average cost of local updates at cluster C j multiply by
the average number of frequency of update issued by the transaction Tk to the fragment Fi
at the cluster C j .

CLUsum(Tk , Fi , C j ) = CLU(Tk , Fi , C j ) ∗ FREQLU(Tk , Fi , C j ) (2)

r Cost of space is equal to the cost of space occupied by the fragment Fi in the cluster C j
multiply by the size of the fragment Fi (in bytes).

CSPsum(Tk , Fi , C j ) = Csp(Tk , Fi , C j ) ∗ Fsize(Tk , Fi ) (3)


Springer
10 I. O. Hababeh et al.

r Remote update is equal to remote updates sent from other clusters C x ; the average cost of
local updates at cluster C j multiply by the average number of frequency of update issued
by the transaction Tk to the fragment Fi for each cluster except the current one.

CRUsum(Tk , Fi , C j ) = CLU(Tk , Fi , C j ) ∗ FREQRU(Tk , Fi , C j ) (4)

r Remote communications is equal to remote communications from other clusters C x ; the


update ratio (Uratio = Unit Update/Unit Communication) multiply by the average number
of frequency of update issued by the transaction Tk to the fragment Fi at the cluster C j
multiply by the average cost of communication between clusters except the current one.

CRCsum(Tk , Fi , C j ) = Uratio ∗ FREQLU(Tk , Fi , C j ) ∗ CRC(Tk , Fi , C j ) (5)

According to the previous formulas the cost of allocation CA (Tk , Fi , C j ) is defined as

CA(Tk , Fi , C j ) = CLRsum(Tk , Fi , C j ) + CLUsum(Tk , Fi , C j )


+ CSPsum(Tk , Fi , C j ) + CRUsum(Tk , Fi , C j )
+ CRCsum(Tk , Fi , C j ) (6)

3.1.2 Cost of not allocating a fragment to a cluster: CN (Tk , Fi , C j )

The cost of not allocating the fragment Fi to the cluster C j is computed as the sum of the
cost of local retrievals and sum of the cost of remote retrievals.

r Cost of local retrievals is equal to the average cost of local retrievals at cluster C j multiply
by the average number of frequency of retrieval issued by the transaction Tk to the fragment
Fi at the cluster C j . It is defined in previous Section 3.1.1.
r Cost of remote retrievals is equal to remote retrievals from other clusters C x ; the retrieval
ratio (Rratio = Unit Retrieval/Unit Communication) multiply by the average number of
frequency of retrieval issued by the transaction Tk to the fragment Fi at the cluster C j for
each cluster except the current one multiply by the average cost of communication between
clusters.

CRRsum(Tk , Fi , C j ) = Rratio ∗ FREQRR(Tk , Fi , C j )∗ CCC (7)

According to the previous formulas the cost of not allocation CN (Tk , Fi , C j ) is defined as

CN(Tk , Fi , C j ) = CLRsum(Tk , Fi , C j ) + CRRsum(Tk , Fi , C j ) (8)

Springer
A high-performance computing method for data allocation in distributed database systems 11

3.1.3 Allocation decision value for a cluster: ADV(Tk , Fi , C j )

We define the allocation decision value ADV (Tk , Fi , C j ) that allocates the fragment Fi issued
by the transaction Tk to the cluster C j as a logical value and compute it as follows:


1; C N (Tk , Fi , C j ) ≥ CA(Tk , Fi , C j )
ADV(Tk , Fi , C j ) = (9)
0; C N (Tk , Fi , C j ) < CA(Tk , Fi , C j )

If the allocation decision value is true (1), then the fragment is permanently allocated to the
cluster C j . On the other hand, if the allocation decision value is false (0), then the fragment
is cancelled from the cluster C j .
Based on the basics and formulas described above, we define our fragment allocation
algorithm as follows:

Input: Tmax: number of transactions issued in the database


Fmax: number of fragments used for allocation in the database
Cmax: number of clusters used for allocation in the fragment
Output: The fragments that are allocated to the clusters
Step 1: For K = 1, Tmax, do steps 2–21
Step 2: For I = 1, Fmax, do steps 3–20
Step 3: For J = 1, Cmax, do steps 4–19
Step 4: Set 0 to CRUsum(Tk , Fi , C j ), CRCsum(Tk , Fi , C j ), CRRsum(Tk , Fi , C j )
Step 5: For X = 1, Cmax, do steps 6–11
Step 6: If X = J, do steps 7–9
Else, go to step 10
Step 7: CRUsum(Tk , Fi , C j )= CRUsum (Tk , Fi , C j )+ CLU(Tk , Fi , C x )* FREQRU
(Tk , Fi , C x )
Step 8: CRCsum(Tk , Fi , C j )= CRCsum(Tk , Fi , C j )+ Uratio * FREQLU(Tk , Fi , C x )
* CRC(Tk , Fi , C x )
Step 9: CRRsum(Tk , Fi , C j )= CRRsum(Tk , Fi , C j )+ Rratio * FREQRR(Tk , Fi , C x )
* CCC
Step 10: End If
Step 11: End For
Step 12: CA(Tk , Fi , C j ) = CLRsum(Tk , Fi , C j ) + CLUsum(Tk , Fi , C j ) + CSPsum
(Tk , Fi , C j ) + CRUsum(Tk , Fi , C j ) + CRCsum(Tk , Fi , C j )
Step 13: CN(Tk , Fi , C j ) = CLRsum(Tk , Fi , C j ) + CRRsum(Tk , Fi , C j )
Step 14: ADV(Tk , Fi , C j ) = (CN(Tk , Fi , C j ) >= CA(Tk , Fi , C j )
Step 15: If ADV(Tk , Fi , C j ) = True, go to step 16
Else, go to step 17
Step 16: Allocate the fragment to the current cluster, go to step 18
Step 17: Cancel the fragment from the current cluster
Step 18: End If
Step 19: End For
Step 20: End For
Step 21: End For
Step 22: Stop.
Springer
12 I. O. Hababeh et al.

Table 5 The fragments frequencies of retrieval and update

Fragment # Cluster # Site # Retrieval frequency Update frequency

F1 C1 S1 80 10
S2 60 26
C2 S3 60 16
S4 0 0
C3 S5 35 5
S6 25 5
F2 C2 S3 20 4
S4 20 6
C3 S5 5 30
S6 105 20
F3 C1 S1 0 20
S2 0 10
C2 S3 30 0
S4 0 0
C3 S5 40 30
S6 30 10
F4 C1 S1 10 20
S2 10 20
C2 S3 65 12
S4 5 12
F5 C1 S1 70 20
S2 6 10
C2 S3 20 10
S4 20 10
C3 S5 35 10
S6 45 20
F6 C1 S1 0 10
S2 0 0
C3 S5 25 5
S6 5 5
F7 C2 S3 25 5
S4 35 10
C3 S5 10 0
S6 30 0
F8 C1 S1 10 20
S2 80 20
C2 S3 20 0
S4 60 10
C4 S5 0 20
S6 20 0

To illustrate the simulation of our fragment allocation method and demonstrate its applicabil-
ity, performance and usefulness, we propose the following input data: eight fragments to be
distributed over the clusters generated in the previous section, number of retrieval and update
frequencies requested from each cluster and its sites, costs of space, retrieval, and update in
each site for each cluster, and the size of each unit of retrieval, update and communication.
Tables 5–7 show the proposed data.
The results obtained from our allocation algorithm are the allocated and cancelled frag-
ments in all clusters which are described in Table 8.
Springer
A high-performance computing method for data allocation in distributed database systems 13

Table 6 Costs of space,


retrieval, and update Cluster # Site # Cost of space Cost of retrieval Cost of update

C1 S1 0.004 0.15 0.25


S2 0.006 0.25 0.35
C2 S3 0.005 0.15 0.25
S4 0.007 0.17 0.27
C3 S5 0.003 0.13 0.23
S6 0.005 0.15 0.25

Table 7 Size of each unit of


retrieval, update, and Unit type Size in bytes
communication
Retrieval 2
Update 3
Communication 5

Table 8 The allocated and cancelled fragments in all clusters

Cost of Cost of not Allocation


Fragment # Cluster # allocation allocation decision value Allocation status

F1 C1 59.45 177.24 1 Allocated


C2 74.83 74.76 0 Cancelled
C3 85.5 74.16 0 Cancelled
F2 C2 74.26 49.84 0 Cancelled
C3 30.01 135.96 1 Allocated
F3 C1 60.32 0 0 Cancelled
C2 103.23 37.38 0 Cancelled
C3 54.72 86.52 1 Allocated
F4 C1 47.13 25.32 0 Cancelled
C2 68.73 87.22 1 Allocated
F5 C1 86.56 96.21 1 Allocated
C2 92.66 49.84 0 Cancelled
C3 86.80 98.88 1 Allocated
F6 C1 15.46 0 0 Cancelled
C3 18.31 37.08 1 Allocated
F7 C2 7.41 74.76 1 Allocated
C3 34.71 37.08 1 Allocated
F8 C1 59.22 113.94 1 Allocated
C2 95.63 99.68 1 Allocated
C3 80.12 24.72 0 Cancelled

Figure 3 shows the distribution of the proposed fragments over the clusters.

3.2 Allocating fragments to the sites

To determine the fragments that would be allocated to the sites, we will apply our fragment
allocation algorithm for the sites of each cluster allocated by those fragments. Initially, the
fragments are allocated to all sites having applications using those fragments, and thus the
allocation decision values of allocating the fragments to the sites are computed in the same
way as illustrated in Section 3.1 taking into account the data of the sites instead of the data of
clusters. The fragments are permanently allocated to or cancelled from the sites according to
Springer
14 I. O. Hababeh et al.

Allocated Fragments #:
1, 5, 8
Cluster #1

Site #1 Site #2

Allocated Fragments #: Allocated Fragments #:


2, 3, 5, 6, 7
4, 7, 8
Cluster #3
Cluster #2

Site #3

Network
Site #5

Site #6

Site #4

Fig. 3 Fragments allocation to clusters

their allocation decision values. If the allocation decision value is true, then the fragment is
permanently allocated to the site. On the other hand, if the allocation decision value is false,
then the fragment is cancelled from the site.

4 Performance evaluation

We have studied the clustering and the fragment allocation in DDBs and performed an exten-
sive experimental analysis on our algorithms whose characteristics are reported in Sections
2 and 3. In the following subsections we detail the performance evaluations obtained by our
clustering and fragment allocation algorithms.

4.1 Performance evaluation of clustering

Number of communications. We think that grouping sites into clusters reduces the num-
ber of communications which will result in minimizing the communication costs that are
needed in further processes through fragment allocation phase. In the example introduced in
Section 2 where we simulate a network of six sites, each site is communicated with the other
5 sites, so the initial total number of communications is (6 * 5 = 30). After clustering the
sites into three clusters, each cluster is communicated with the other 2 clusters taking into
account the communication within the cluster itself because it contains sites with different
communication costs; in this case the total number of communications is 9. Therefore, a
high-performance is achieved by using our clustering method, which reduces the number of
communications from 30 to 9 and enhances the system progress by 70% as described in the
following formula:

Reduced number of communications 21


Improvement percentage = = = 0.7
Total number of communications 30
Springer
A high-performance computing method for data allocation in distributed database systems 15

Performance Evaluation of Clustering


(Communication Cost)
80
60

Communication
40

Cost
20
0
C1 C2 C1 C3 C2 C3
Communication between Clusters

Initial Communication Cost Final Communication Cost

Fig. 4 Clustering performance evaluation

Table 9 Performance
improvement of allocating Initial # of allocated Final # of allocated
fragments to clusters Cluster# fragments fragment Improvement

C1 6 3 +50 %
C2 7 3 +57.1 %
C3 7 5 +28.6 %

Cost of communications. We used the average communication cost between clusters and
their sites because the time complexity for computing the average communication cost is
less when other methods are used which depend on sorting sites to find the least com-
munication cost. In this case the initial communication cost between the sites in clus-
ter 1 and the sites in cluster 2, for example, which is equal to 60 (10 + 7 + 6 + 7 +
10 + 7 + 6 + 7 from Table 1 above), is replaced by the average of the communication
cost between cluster 1 and 2 (C1, C2) and between cluster 2 and cluster 1 (C2, C1)
which is equal to 15 (7.5 + 7.5 from table 3 above). Thus, the communication cost is
reduced from 60 to 15 and the performance improvement is computed in the following
formula:

Reduced cost of communications 45


Improvement percentage = = = 0.75
Total cost of communications 60

We can say that a high-performance is also achieved and the system enhanced by 75%.
The communication costs between the other clusters (C1, C3) and (C2, C3) are also
minimized in the same way and thus enhance the system performance by about the same
percent. Figure 4 depicts the performance achieved by our clustering method.

4.2 Performance evaluation of fragment allocation

We think that the system performance is enhanced by removing the redundant fragments
from the database clusters and their sites, and by increasing availability and reliability where
multiple copies of the same fragments are allocated. This will reduce the communication
costs where the fragments are needed frequently.
Springer
16 I. O. Hababeh et al.

Fragments Allocation to Clusters


8
7

Number of Fragments
6
5
4
3
2
1
0
C1 C2 C3
Cluster Number

Initial number of Fragments Final number of Fragments

Fig. 5 Fragments allocation to clusters


Table 10 Performance
improvement of allocating Initial # of allocated Final # of allocated
fragments to sites Site # fragments fragments Improvement

S1 3 2 +33.3%
S2 3 2 +33.3%
S3 3 3 0%
S4 3 2 +33.3%
S5 5 4 +20%
S6 5 4 +20%

4.2.1 Allocating fragments to clusters

Statistics related to the performance of our fragment allocation method can be collected by
analyzing the information generated from Table 8 in Section 3.1 and summarized in Table 9.
Initially, allocating fragments to all clusters having applications requesting these fragments
generates a total of 20 allocations, while 11 allocations are generated after applying our
fragment allocation. In this case the system performance improved by an average of 45% as
defined in the following formula:

Reduced number of allocations 9


Improvement percentage = = = 0.45
Initial number of allocations 20

Figure 5 shows the improvement of the system performance achieved by our fragment
allocation method over the clusters.

4.2.2 Allocating fragments to sites

We generate 22 allocations after clustering and allocating fragments initially to all sites
having applications requesting these fragments, while 17 allocations are generated when we
perform our fragment allocation method into the sites. The following formula defines the
performance improvement for allocating the fragments to the sites after clustering.

Reduced number of allocations 5


Improvement percentage = = = 0.227
Initial number of allocations 22
Springer
A high-performance computing method for data allocation in distributed database systems 17

Fragments Allocation to the sites


6

Number of Fragments
5

0
S1 S2 S3 S4 S5 S6
Site Number

Initial number of Fragments


Final number of Fragments

Fig. 6 Fragments allocation to the sites

The system performance in this case can be improved by an average of 22.7%.


Table 10 shows the performance improvement of fragment allocation over the sites.
Figure 6 shows the improvement of the system performance achieved by our fragment
allocation method over the sites after clustering.

5 Conclusion

In this paper we presented an integrated method for clustering and fragment allocation in
DDBs. Our clustering method was able to minimize the number of communications and
the communication costs between the sites. An experimental analysis was conducted over
the DDBs sites and showed best performance for grouping sites into clusters. The results
obtained from our fragment allocation method demonstrated high performance, and enhanced
the database applications progress in a network environment by optimizing the fragment
allocation and by increasing availability and reliability where multiple copies of the same
fragments are allocated. This approach can be implemented in different network environments
even if the input parameters are very large.

References

1. Karlapalem K, Navathe S, Morsi M (1994) Issues in distribution design of object oriented databases.
Distributed Object Management, Morgan Kaufmann Publishers
2. Ezeife C, Barker K (1998) Distributed object based design: vertical fragmentation of classes. Int J Distr
Parallel Databases, 6(4):327–360. Kluwer Academic Publishers.
3. Yee W, Donahoo M, Navathe S (2000) A framework for server data fragment grouping to improve server
scalability in intermittently synchronized databases. CIKM.
4. Huang Y, Chen J (2001) Fragment allocation in distributed database design. J Inf Sci Eng 17:491–506
5. Cheng C, Lee W, Wong K (2002) A genetic algorithm-based clustering approach for database partitioning.
IEEE Trans Syst Man Cybern—Part C: Appl Rev 32(3)
Springer
18 I. O. Hababeh et al.

6. Lim S, Kai Y (1997) Vertical fragmentation and allocation in distributed deductive database systems. Inf
Syst 22(1):1–24
7. Hwang S, Yang C (1998) Component and data distribution in a distributed workflow management system.
IEEE Soft Eng Conf 244–251
8. Son J, Kim M (2003) An adaptable vertical partitioning method in distributed systems. J Syst Soft. Elsevier
9. Daudpota N (1998) Five steps to construct a model of data allocation for distributed database systems. J
Intell Inf Syst: Integr Artif Intell Database Technol 11(2):153–68
10. Lee H, Park Y, Jang G, Huh S (2000) Designing a distributed database on a local area network: a method-
ology and decision support system. Inf Soft Technol 42:171–184
11. Tamhankar A, Ram S (1998) Database fragmentation and allocation: an integrated methodology and case
study. IEEE Trans Syst, Man Cybern—Part A. Syst Hum 28(3):288–305

Springer

You might also like