An Effective Algorithm For Parallelizing

An Effective Algorithm for Parallelizing Sort Merge Joins in the Presence of Data Skew
Joel L. Wolf, Daniel M. Dias and Philip S. Yu
P.0. Box 704, Yorktown Hei&ts, N.Y. 10598

IBM Research Division, T. J. Watson Research Center
Abstract: To exploit parallelism, queries are divided into multiple tasks

Parallel processing of relational queries has received considerable at- which can run simuftaneously on the various processors. The effec-
tention of late. However, in the presence of data skew, the speedup tiveness of parallel execution depends upon the ability to equally di-
from conventional parallel join algorithms can be very limited, due vide the load among the processors while simultaneously minimiiing
to load imbalances among the various processors. Even a single large the coordination and synchronization overhead. A factor which can
skew element can cause a processor to become overloaded. In this impair the ability to parallelize join queries successfully in a straight-
paper, we propose a parallel sort merge join algorithm which uses a forward fashion is the amount of skew present in the data to be
divide-and-conquer approach to address the data skew problem. The joined. In real databases it is often found that certain values for a
proposed algorithm adds an extra scheduling phase to the usual sort, given attribute occur more frequently than other values [CHR183,
transfer and join phases. During the scheduling phase, a parallelixable LYNC88, MONT83]. [LYNC88] notes, for example, that for many
optimization algorithm, using the output of the sort phase, attempts textual databases the data distribution follows a variant of Zipf’s law
to balance the load across the multiple processors in the subsequent [ZlPF49]. Similar distributions for real databases are reported in
join phase. The algorithm naturally identifies the largest skew ele- [WOLF90]. This nonuniformity is referred to as data skew
ments, and assigns each of them to an optimal number of processors. [LAKS88]. It is inherent in the data itself and does not depend on
Assuming a Zipf-like distribution for data skew, the algorithm is the access pattern. [LAKSRE] found that in the presence of data
demonstrated to achieve very good load balancing for the join phase skew, the speedup from conventional join algorithms can be very
in a CPU-bound environment, and is shown to be very robust rela- limited, since the data skew can result in some processors being
tive to the degree of data skew and the total number of processors. overutilized while others are underutilized. Even a single large skew
element can cause the processor to which it is assigned to become
overloaded. The problem is exacerbated for join operations as op-
1. Introduction: posed, say, to sorts, because correlation in the data skew of each re-
As relational database queries become more complex and relations lation results in a join output which is quadratic in nature. Previous
grow larger, performance becomes an increasingly critical issue. Par- studies on join performance have largely ignored this phenomenon
allel processing of database operations is an attractive approach which and assumed uniform distribution of data, thus overestimating the
might be expected to improve response times and offer the potential potential benefit of parallel query processing using conventional join
for incremental growth. Parallel architectures which exploit a large algorithms. In [SCHN89] some aspects of data skew on parallel join
number of processors have become an area of active research. In re- methods are studied. However, in their study the case where both
cent years there have been several proposals, prototypes, and com- relations to be joined have data skew (double skew) was explicitly not
mercial systems making use of parallel processor architectures for examined. This case. produced a large number of output tuples
database applications [DEMUBS, DEW186 HSIA83, NECH84, (368,474) as compared to the case with uniform distributions of data
VALD84, OZKA86]. All these systems distribute the data across (10,000 output tuples), and the authors could fmd no way of nor-
several storage units and deploy the available processing power near malizlng the results for the double skew case to meaningfully compare
each of the storage units. The rationale behind a distributed approach them with the other cases. This is precisely the type of data skew that
is that data can be accessed in parallel from all the storage units, and motivates out paper. Furthermore, we believe that the two single
further that the data can be processed in parallel by all the processors, skew cases (i.e., skew in one relation only) that they look at are ones
Various studies [BICSS, DEW187, SALZ83, QADA85, VALD843 of mild skew, because the number of output tuples generated for the
have been made to evaluate the performance of different database worst case they examine produces almost the same number of tuples
machines. (10,036) as that for the uniform case (10,000 tuples). As we will see,
103
CH2895-1/90/0000/0103$01 .OO0 1990 IEEE
even single skew can have a large effect on performance. Finally, for jective function.
examining the effect of data skew, they examine the case of 8
processors. We will show that the effect of data skew becomes more
We show that the improvement in the join phase over conven-
pronounced as the number of processors is increased. Obviously,
tional algorithms is drastic in the high skew case. In fact, the pro-
more processors will be utilized in future database machines. Our
posed algorithm is demonstrated to achieve very good load balancing
paper examines cases with up to 128 processors.
for the join phase in all cases, being very robust relative to the degree
of data skew and the number of processors. A Zipf-like distribution
The sort merge join [BLAS77] and hash join [KITS83, BRAT84, is used to model the data skew.
DEW1851 methods are popular algorithms for computing the equi-
join of two relations. In this paper we examine the sort merge join
The environment and assumptions are described in Section 2.
method and propose an effective way to deal with the data skew
The scheduling phase algorithm is presented in Section 3. In Section
problem. In a typical pam.Uel sort merge join, e.g. [IYERtJE], each
4, a sensitivity analysis is provided to demonstrate the robustness of
of the relations is fust sorted, in parallel, according to the join col-
algorithm and the speedup in the join phase over conventional algo-
umn. This is called the sort phase. A transfer phase follows, in which
rithms. Finally, in Section 5, we summarize the results and outline
the output of the sort phase is shipped to the various processors ac-
our future work. A hash join version of our parallel join algorithm
cording to some range algorithm. Finally, in the join phase, the
has also been devised [WOLF90].
sorted ranges are merged and joined. Each processor handles its own
range of data. Conventional parallel join algorithms do not capture
the effects of skew distribution io the join column. As indicated, the 2. Environment and Assumptions:
impact to performance can be devastating [LAKSESJ.
In this section we outline the system architecture that we assume,

The proposed parallel join algorithm uses a divide-and-conquer elaborate on the overall join algorithm that we employ, and define the
approach to address the data skew problem. An extra scheduling data distributions that we use to examine the performance of the al-
phase is introduced after the sort phase in an attempt to balance the gorithms.
load across the multiple processors during the subsequent join phase.
Based on output from the sort phase, the scheduling phase divides the We describe our algorithm in the context of the so-called
join into multiple tasks and attempts to make optimal task assign- shared nothing architecture [STON86], illustrated in Figure 2.1. The
ments to balance the load. Two basic optimization techniques are architecture has also been referred to as the dufaparfifioning archi-
employed iteratively, one to divide up the tasks, and the other to tecture [CORN86]. In this system, there are two sets of processors
balance the load. These two algorithms form the basic building -- database processors and control processors. There are one or more
blocks of the scheduling phase. The first of these solves a variation disks attached to each database processor. Each relation in the data-
of the so-called selection problem, and is due to Gti and Megiddo base is horizontally partitioned among the database processors by
[GAL179]. The algorithm, henceforth labeled GM, is used to divide applying a partitioning function to the primary key of the relation.
up the tasks. (The selection algorithm was originally used to find an Each database processor has its own memory and operating system,
fth smallest element in an Ix J matrix whose columns are non- and independently accesses its local disks. The database processors
decreasing. The problem has its roots in convex optimization the- cooperate by sending messages across the interconnection network.
ory.) GM has the following two pleasant features: First, it can be The interconnection network may be a point-to-point or a multipoint
parallelized easily across the multiple processors. Second, by its na- network. While the particular method used for interconnecting the
ture, the repeated execution of the algorithm naturally identifies the processors is not crucial to the thrust of our paper, it impacts the
largest skew elements. The second building block heuristically solves performance and overall speedup, particularly in the transfer phase.
the so-called minimum makespan or muhiprocessor scheduling prob- The control processors interface to users and also send database re-
lem. The algorithm employed, known as LPT (for quests to the database processors. While the methods described here
longeslprocessing time fust), is due to [GRAH69]. As the name can also be applied to parallel joins in the so-called data sharing or
“multiprocessor scheduling” suggests, LPT is used in our algorithm shared everything architecture [YU87], we do not address this aspect
to balance the load across the processors. Although the minimum in the paper.
makespan problem is known known to be NP-complete, LPT is a
very fast heuristic which has reasonably good worst-case performance The parallel sort merge join algorithm that we consider is broadly
and excellent average-case performance. In solving the scheduling similar to that in [IYERII], and consists of four phases. In the fast
phase optimization problem we assume that the system is phase, each processor p locally sorts its own partition R,, and R2,, of
CPU-bound, so that a CPU pathlength estimate is used in the ob- each of the relations R, and R, respectively, and places the resulting
104
- - To examine the speedup achievable in the join phase by the al-
I
"- gorithm proposed in this paper, we use synthetic data for the values
t
e in the join column, based on a Zipf-lie distribution [KNUT73] as
r N follows: We assume that the dotnain of the join column has D dis-
ce
0 t tinct values. The the probability ,q that the join column value of a
nw-
n0
particular tuple takes on the ilh value in the domain 1 I is D is
e r
i,= c/P@) , where c = l/&l/’ l(l - @)
) is a normalization constant. We
ck
t also assume that each t$de’s join column value is independently
i
o- chosen from this distribution. Setting the parameter 0 = 0 corre-
”
u I sponds to the pure Zipf distribution, which is highly skewed, while
Control Database Disks 0 = 1 corresponds to the uniform distribution. We will use B = 0.5
Processor Processor ‘S
as a case of moderate skew. The Zipf-like distributions corresponding
Figure 2.1. Data Partitioning Architecture
to D = 100 and 0 = 0.0, .25, S, .75 and 1.0 are shown in Figure 2.2.
In ELYNC883, data from large bibliographic databases are used to
sorted run on its local disk. This sort phase could be done using an
support models of skewed column value distributions based on Zipf
optimal tournament tree external sort as in [lYIiR88]. The second
distributions. See also [WOLF903.
phase of this algorithm is a scheduling phase that attempts to split the
join execution into tasks and assign tasks to the processors in an op-
timal manner so as to minimize the overall completion time, or Though the data vaIues are assumed to be skewed, we assume that
makespan. It is this scheduling phase that is the crucial aspect of our the partitioning function is such that the relations to be joined, R, and
paper, and the algorithm is described in detail in Section 3. The third R2, are more or less uniformly partitioned among the processors, i.e.,
phase is a transfer phase in which the data from different ranges of each processor has a comparable number of tuples of each relation.
each of the sorted relations is shipped to the processor(s) assigned For example, if the tuples are range partitioned on the primary key,
during the scheduling phase. Since the scheduling phase partitions then the ranges can be adjusted to approximately balance the number
the data into ranges or single distinct values (as described later), this of tuples in each partition. For most skew distributions and numbers
transfer phase can be accomplished by a single pass through the data. of processors, such range partitioning will lead to good balance.
Finally, in the join phase, the sorted ranges are read from local disk,
merged and joined, and the join outputs w&ten to disk. 3. Scheduling Phase Algorithm:
As described above, the transfer phase involves an additional pass To introduce the algorithm which forms the scheduling phase of
through the sorted runs of the two relations to be joined. It is pos- the proposed sort merge join approach, suppose that Y, I v, are two
sible to do this phase without extracting tuplcs from data blocks values in the domain of the join cohmms. Let P denote the number
provided that the relative byte addresses (RIMS) of the partition of processors. Given any of the 2P sorted runs created during the sort
boundaries are determined in the scheduling phase. This would phase, for example the one corresponding to processor p&{ I,...,Pj and
considerably reduce the overhead of the phase. Iyurther optimizations relation r&{ 1,2), there is a well-defined (possibly empty) contiguous
are possible, such as combining this transfer step with the join phase. subset Pi,,..,. .I consisting of all rows with sort column values in the
For instance, assuming the join operation is CPU-bound, only one interval [v,, ~1. Shipping each of the pP,,,“,,”1 over to a single
join task need be active on each processor at any time. This task is processor for fmal merging and joining results in an independent task
assigned to merge ranges (or a single value) from both relations. T?,,~~of the total remaining part of the join operation. (The super-
These ranges can be read at each processor, shipped to the assigned script here underscores the fact that a single processor is involved.
processor, merged with the corresponding ranges from the other The signiticance of this will become apparent shortly.) Assume that
processor, and joined, all in one pass through the data. ‘I’he penalty we can estimate the time it takes to perform this task, as we shall do
of such a scheme is that each task now has a set-up overhead of in Section 3.4.
starting the range reads on all the other processors. The need for
additional data buffers and synchronization delays in reading data Given Y, I v,, precisely one of two special cases may occur: Either
from diierent processors increase the complexity of such an opti- v, ( v2 9 or v, = v2. We shall call a pair (v,, q) satisfying v, <v, a
mization. In a data sharing environment, all processors have direct type 1 pair. In the case where Y, = v,. the join output is just the
access to the disks. Therefore, the transfer phase can be eliminated cross-product of the two inputs. We shall call a pair (v,, v,) satisfying
for such an architecture. Y, = v, a type 2 pair. Actually, for type 2 pairs, say with Y = v, = v,,
we may wish to consider the additional possibility of partitioning
105
ZIPF-LIKE DISTRIBUTIONS
“perfect” assignment, a@” necessarily possible, would have each
processor busy for x I: TIME;,,ln3/P units of time.) Spccitically,
n=* rn=l.
we would like to assrgn each task 7m““,,.“”3 to a processor
AS=W;,,,,J in such a way that the completion time of the total
- META-0.W
ioh
mETA-0.w
----- mElA-o.so
- - niEr*-0.75
‘. .’ THErA-1.00
is minimized. This optimization problem is essentially the so-called

minimum makespan or multiprocessor scheduling problem. Although
0
0 20 40
it is known to be NP-complete, there exist a number of very fast
IO 100
heuristic algorithms (LPT, which stands for longestprocessing time
Figure 2.2. Zipf-like distributions for various parameter choices
fast, due to Graham [GRAH69], and MULTIFIT, due to Coffman,
Carey and Johnson [COFF78] being prime examples) which have
reasonably good worst-case performance and excellent average-case
one of the two sets CP~,,,~,,and (pIppJ,.,. as evenly as possible into
performance. For concreteness, we have adopted the LPT algorithm.
MULT sets, where 1% MULT 5 p:‘and creating still finer independ-
For completeness, we describe the LPT algorithm in Section 3. I.
ent tasks r!.,, . . . . 7fit”“’ of essentially equal task times. In task
r~“, mc( 1, .... MULT), the cross-product of one of the sets and the
The point is that we have control over how the sequence of pairs
mth partition of the other set is performed on a single processor.
of values and corresponding multiplicities is created. The goal in the
Exactly which of the two sets should be partitioned depends on the
scheduling phase of the proposed algorithm is to create this sequence
time estimate of each alternative, and will be deferred until Section
via a divide-and-conquer approach.
3.4. The goal here is to avoid swamping one of the P processors with
too much work. We do not insist that each of the MULT tasks be
performed on a different processor, though in practice this is likely to Specifically, we will use an algorithm due to Galil and Megiddo
be the case. Performing MULT > 1 tasks would, on its surface, ap- [GAL1791 to split large tasks of type 1 into two to three tasks, at least
pear less efficient than perfoming one, since the input from one of the one of which is of type 2. This algorithm, henceforth known as GM,
relations must be shipped to each of the processors involved. If this was originally designed to deal with a somewhat diierent problem.
were the case, we would want to make use of this approach only to In that context, it was intended to fmd an fth smallest element in an
handle excessive skew. However, as will be seen, the amount of ad- I by J matrix whose columns were monotone non-decreasing. This
ditional work created may be counterbalanced by savings due to problem is known in the literature as the selection problem. It is
available processor memory, and will in all likelihood be negligible useful in separable convex resource allocation problems.
compared to the overall task time. We will say that the type 2 pair
(v,v) divided in this manner has multiplicity MULT. (A type 1 pair In fact, suppose v, -z v,, and that ~!,.~ais a type 1 task. Let p de-
(v,, vr) will be said to have multiplicity 1.) A task 7;. 1 is said to have note the median element of the union of both join columns which
the same type and multiplicity as the pair (vr, v~)from which it arises. have values in the range [v,, vr]. The GM algorithm will divide each
set pp,,,“,,“~into three contiguous (possibly empty) regions -- ~j.,,rr,~~,
Now we can state our general approach: Suppose we create a consisting of rows with values less than p; p:.,.r,+r,, consisting of rows
sequence of N pairs of values with corresponding multiplicities in the with values equal to p; and ~~,,,~,,~a,consisting of rows with values
domain of the join columns. This sequence will have the form greater than p. Thus GM creates three tasks where there had been
v,,, 5 V,J c .. -EI&,, 5 v.-,J -e v,, s v&J-=Iv.+1,,s v.+,s -c . . < v,, I VNJ. one before. The first or third task (but not both) might be empty.
Each value in the join columns of R, and R2 is required to fall within Either one could be of type 1 or type 2. The second task will not be
one of the intervals [v,,, v,J. For ne{ 1, .... N), let MULT, denote the empty, and will be of type 2. As we shall seein Section 4, large single
multiplicity of the pair (v,,,,v.~. We have created I? = &4ULT, in- skew elements are quite likely to be created as “second tasks” during
dependent tasks r; ,,.”n3 with times TIME; ,,,“”3 to be &&e at the P an application of the GM algorithm. (There exists a theoretically
processors. NJ$&tal computing time involved can therefore be esti- faster algorithm, due to Frederickson and Johnson [FRED82,
mated as x 1 ^TIME;,,, = $MULT,TIME<~?$, which we FRED841, for the selection problem. Unfortunately, from our point
wish to di&/bu?~ as evenly as p%ble among the processors. (A of view it suffers from two deficiencies: First, it does not automat-
ically provide the three-region partition, which we require. Second,
106
it does not appear to parallelize easily. In contrast, the Galil and 3.2. GM:
Megiddo algorithm parallelizes naturally to P processors, each han- Procedure: GM
dling the two sets of sorted runs they created in the first place. We
Input: For each js( 1,...J), a c01un~1(~1 i = TOPj,...t BOT,) of non-
remark, however, that the Frederickson and Johnson algorithm may
&creasing elements, where TOPj 5 BOT, are indices of the column
be the algorithm of choice for handling skews in sorting. (The precise
ranges under consideration.
three-region partitioning required for correctness in joining is not
needed for sorting.) See [IYER89] for yet another selection problem Output: An Ith smallest element q, and, for eachjs{ l,...,JJ, a partition
algorithm applied to that particular problem.) [IBARBB] contains of each range {TOP,,..., BOT,} into three new ranges,
good descriptions of the GM and other selection problem algorithms, {TOP,! ,..., SOT} with values less than ‘1, {TOP: ,..., BOTf} with values
but for completeness, and because of the slightly different use to equal to ‘I, and {TOP;,..., 80’1;3) with values greater than ‘I.
which we put it, we shall describe GM in Section 3.2. See also Set 7; = TOP, and B, = BOT, for each j&(1,..., J). Set i = 1.
[TANTBB] for a computer science application of a generalization of Do forever.
the selection problem.
SetMi=8,-T,+lforeachjs{l,...,J). SetS=$MI
j=l
Algorithmic descriptions and notes on LPT appear in Section 3. I. For eachjc(l,...J), !ind the median element q, of the set
Section 3.2 handles GM. Section 3.3 deals with the proposed sched- {a,J 1i = r,,..., B,). Sort the medians in non-decreasing order,
uling algorithm itself, labeled SKEW. SKEW works by repeatedly so that ‘115 ... skqj,. Compute the value k such that
h-1
switching back and forth between LPT and GM. In Section 3.4 we xMj, < -$ and xMji L 3. Set u= qj*.
I4 i=l
deal with task time estimation. Compute for each jc{ I,...a,
Tjr; = min {il(TOP, 5 i_< 807;) A (ai, = q)), and
3.1. LPT:
SE, = max {il (TOP, I i 5 BO?) r, ((I,~ = q)}. Set
Procedure: LPT M’ = x(T7; - T,) and M2 = C(BB, - T, + 1).
j=l j=l
Input: Number of processors P, number of tasks fi , and task times IfCM1<jIM’l
{TIME”1 n = l,...,lir).
Then begin
Output: A heuristic assignment of the tasks to the processors which
approximately minimizes the makespan. n is an Ith smallest element. Set TOP,! = TOP,,
807;’ = T7; - 1, TOP,? = TT., 807: = BB,,
Sort the tasks (ii necessary) in order of decreasing TIME,.
TOP: = BB, + 1, and 807: = BOT, for each jr(l,...J).
Set TOTAL, = 0 for each processor p. Halt.
Doforn=lto$. End
Assign task n to the processor p for which TOTAL, is mini- If [ML 2 i] then set B, = TT, for each j&{ l,...J).
mum. (Ties are decided in favor of smaller p.)
lf [M, < r^] then decrement ! by M’ and set 7; = BBj + 1 for
Add TIME, to TOTAL,,. each jc[ l,...JJ.
End do.
End do.
End LPT
End GM
Notes on LPT:
Notes on GM:
. The makespan in the algorithm is represented by es;TOTALp .
We will always apply GM in the case where J= 2P, twice the
. Considerable work has been done on analyzing the worst-case
number of processors. We will always be looking for the median
behavior of LPT (and MULTIFIT). We reiterate that the
element.
worst-case and average-case behaviors are far apart, worst-case
behavior being good, and average-case behavior beiig excellent. For ease of exposition, we have purposely ignored the details of
casesin which a column or region therein is (or becomes) empty.
. The computational complexity of LPT is O& log i + log P)).
1 h The details are somewhat messy, and not essential to under-
The presumably dominant term, N log N, comes from the sorting
standing the algorithm.
step, for which we employ QUICKSORT. See [AH074].
The TT, and BB, values be found by binary search. This can be
done in parallel by each of the P processors. The median ele-
ments ‘1, can also be found in parallel.
107
3.3. SKEW:
- Procedure: SKEW
Input: Number of processors P, 2P sets of sorted runs,
{aim,.,1i = l,..., CARL),,,) , one for each processor pa{ l,..., P] and each
relation rc{ I ,2}, where CARD,, , is the card&&y of the sorted run
of relation r at processor p, and OT,.~,, is the ifh tuple in this sorted run.
-
III
- Output: The creation of tasks and a heuristic assignment of those
222
222 tasks to the processors which approximately minimizes the
zi makespan.
-
Set the number of tasks N = 1.
Set the top and bottom of the first task to be TOP,,, = 1 and
BoTw p,, = CARD,,, for each processor p = l,...,P and each re-
lation r = 1,2.
Determine the type (1 or 2) of the first task.
Do forever.
Determine the optimal multiplicities MULT, of each type 2
L - J
task nc{l,...,N). (Set MULT, = I for each type 1 task
ne{l...,N). ) Compute the total number of tasks to be
Figure 3.1. Using GM to Subdivide a Type I Task
i = $MULT, .
n=,
Compute the task times {TfMEfULrnln= I,..+).
If N 2 P then apply LPT.

We arc basically following [IBARgB], but with the following
If [solution is unacceptable]
modification: When S I J, which will occur at some point dur-
ing the execution of the algorithm, [IBAR employs a linear Then begin
time selection algorithm due to [BLUM72] (and also found in Apply GM to find the median element PC)corresponding
[AH074]) to finish the job. (This particular algorithm is linear to the region (TOP,, .,,..., BOT,,,lp = l,..., P, r = 1,2) of
in the total number of elements involved, so is not of interest the largest type 1 task n.
until that number is quite small. Employing it at that point rc-
The median element corresponds to a type 2 task with re-
duces the computational complexity of the entire problem.)
gion {TOP,&,, . .. . BO7;t,,,lp = l,..., P, r = 1.2) Relabel
However, we soldier on without it, since we also need the three
this new type 2 task as task number n. Determine its op-
region partition.
timal multiplicity MULT, and task time TIME,MuLTn.
[IBAR@] notes that the computation of the value k can be done There also exist (1 or) 2 tasks, most likely of type 1, cor-
without explicitly sorting the medians + Actually, we adopt this responding to regions
improvement as well, but omit it from the description for sii- U-W.,,, ...>BOT,,,,lp = l,..., P, r = 1,2) and
plicity. See [IBAR for details. With that improvement, VOP:p.,, ...I SOT,,,,lp = l,..., P, r = 1,2). Increment N
[IBAR88] obtains a (serial) computational complexity for GM (by 1 or 2) to add these tasks and their optimal multiplic-
of O(I( log 42). ities and task times.
Precisely one of the three “if” conditions at the end of the algo- Sort the tasks in order of decreasing task times, so that
rithm must hold. GM itself works by dividing and conquering. n, I n, implies TIME$uLTnnl 2 TIME4”LTns
End
Else halt with solution from final LPT.
End
End SKEW
108
Notes on SKEW: to the median element of the old task is labeled with 2s. The other
two new tasks are labeled with Is and 3s, respectively. These latter
. In the extremely likely event that the first task created is of type
two tasks may be of type 1, in which case they may be candidates for
1, the task would correspond to performing the entire job phase
subdivision themselves at some further point in the algorithm.
on a single processor. If it is of type 2 instead, then the entire
join is the join of a single element, so that we are forming a full
cross-product of the rows of the two relations. The optimal 3.4. Task time estimation:
multiplicity in this case will be determined to be p, and the al-
In this section we derive the task time estimation formulas. To begin
gorithm will halt with an essentially perfect solution.
with, assume that we have a type 2 task 7:. of multiplicity M ULT.
. In general, the optimal multiplicity for a type 2 task n will be that (The formulas to handle type 1 tasks will be based on the type 2
m with 1 I m < P and task time TIME: with the smallest total formula.) Let K, and K2 denote the sizes (measured in blocks) of the
time mTlME: subject to the constrainis that two sets of tuples in relations R, and R,, respectively, which corre-
TIME: 5 (m TIME: + REST)/P , where REST is the combined spond to the value Y. For ease of exposition, let us assume that K,
time of all other tasks, and that TIME; ;r MINTIME , where is the larger of the two. (The formulas wiIl merely need to be
MINTZME, an input variable, is the largest size task which switched if the reverse is true.) Suppose that S is the memory buffer
SKEW is not allowed to subdivide. The first constraint has the size (also in blocks) for each processor.
effect of requiring m to be greater than some minimum value,
while the second constraint has the opposite effect. The fust We can either split K, into MULT equal parts, or we can split K2
constraint ensures that each individual task must fit within one into MULT equal parts. Let us label these as Methods 1 and 2, re-
of the P processors. MINTIME is used to guard against splitting spectively. We will ultimately pick the method which gives the lowest
tasks too finely. (We do not model task initiation times explic- task time. Whichever method we employ, we will let the larger
itly, but by properly setting MINTIME, we have the same effect.
component correspond to the outer loop, and the smaller component
In fact, the algorithm could be made to “throw away” the small-
correspond to the inner loop. This is provably better than the re-
est tasks it creates, by coalescing them with one of their neigh- verse. The component corresponding to the outer loop will be alIo-
bors.) Since the question of whether a type 2 task fits or not
cated 1 block in the memory buffer, while the component
depends on the multiplicities of the other type 2 tasks, we cycle corresponding to the inner block will use the remaining S - 1 blocks.
through the type 2 tasks in order of size, determining optimal The blocks of the inner loop component cycle through the memory
multiplicities, and then repeat the process until the multiplicities buffer once for each block of the outer loop component, in an
remain stable throughout a complete cycle. altematingly forwards and backwards manner. (This approach might
. The solution can be unacceptable for several reasons. The most accordingly be dubbed the ZIGZAG algorithm.) We thus utilize the
obvious is that the quality of the LPT solution is not within memory in a way which minimizes the total number of blocks that
some input variable TOLERANCE. (If SOLPT denotes the need to be read. Let y = K2/K,. By our convention, 0 5 y < 1 .
makespan of the LPT solution, and SOL,,,,, denotes the
makespan of a “perfect” solution, the quality of the LPT solution Method I: In this case, it is not apparent which of the two values,
WiIl be acceptable if (SOL,,,-
KJMULT or K2 = yK,, is larger. So we let min = min(l/MULT, y) ,
s%mFEcrYso‘LPr
-z TOLERANCE.) However, the following reasons for failure are and max = max(l/MULT, v), A simple analysis then yields a time
also valid: First, it may happen early on in the algorithm that per processor of...
s-c p, in which case LPT is not even called. Second, it may
happen that the time TIMEJ, of the largest type 1 task n may A[ &Kf+2maxK,--maxK,S+S-l]+B- MULT K:
satisfy TIME; 5 MINTIME. Finally, it may happen that the
number $ of tasks already created may satisfy i 2 MAXT, if min K, 2 S - 1, and a time per processor of...
where MAXT is an input variable designed to keep the algorithm
from running too long. Generally, setting MAXT to be on the A[ K:
order of IO times the number of processors proves quite satis-
factory. otherwise. Here, A is a coeficient which equals the per block
pathlength overhead of reading in the data, extracting the tuples,
. We again use QUICKSORT to perform the sorting.
merging the sorted runs, and performing the join comparison. B is a
coefficient which equals the pathlength overhead of inserting the
Figure 3.1 shows a type 1 task being subdivided into three new output tuples generated by joining one block of tuples (with identical
tasks. The entire 2P sets of sorted runs are shown, with the old type join column values) from each of relations R, and R2 into an output
1 task labeled with Is, 2s and 3s. The new type 2 task corresponding file and writing out the data. The second expression corresponds to
109
the case where the smaller component fils in the memory buffer, SPmNG A MPE 2 TASK AMONG MULTIPLE PROCESSORS
while the first expression corresponds to the case where it does not.
Melhod 2: In this case, it is clear that & =&+,SK,.

We thus obtain a time per processor of...
A[ +K:+2K, -K,S+S- l]+B&K;
if & K, > S - 1, and a time per processor of...
AC&T+ l]K, +ByMULT K:
otherwise. Again, the second expression corresponds to the case

where the smaller component, in this case &K,, fits, while the
first expression corresponds to the case where it does not. NUMBER OF PROCESSORS PARTlClPATlNC
Figure 3.2. Total task time as a function of multiplicity

Figure 3.2 shows the graph of total task time (which is MULT
times the per processor task time corresponding to the lower cost
method) as a function of MULT, for a typical choice of parameters.
The first local minimum occurs at the point where the smaller com- 4. Speedup of the Join Phase:
ponent in Method 2 starts to tit in memory. The second local mini-
mum occurs at the point where the smaller component in Method 1 In this section we examine the speedup during the join phase that
starts to fit in memory. This always happens in exactly the same or- can be obtained by using the proposed scheduling algorithm. We
der, since & < min( l/MULT, v). Method 2 is the method of consider the case where the join phase is CPU-bound. While we do
choice for the smaller multiplicities, and Method 1 is the method of not consider it here, the scheduling method can be modified to handle
choice for the larger ones. In any event, note the very gradual rise in the case where the join phase is I/O bound. We also assume a single
total task time past each of the local minima. This is because the query environment (i.e., we do not attempt to optimize the average
output term, which is quadratic and identical for both methods, running time of a number of concurrent queries.) For illustrative
heavily dominates the input term. In fact, note from the y-axis in purposes, we use synthetic join column values distributed according
Figure 3.2 that the difference between the highest total time (which to the Zipf-like distributions that were described in Section 2. We
occurs at MULT = 1) and the lowest (at MULT = 5) is less than 2%. vary the degree of skew of the relations to be joined and the number
of processors involved. We have also examined the effect of different
correlations between the skew values in the two relations, different
Now we turn to the case of type I tasks T:,,.~ . Here we make a
memory buffer sizes, and different sizes of the two relations, and ob-
fairly simplistic estimate that the individual elements {v I Y, 5 v 5 VJ
tained similar results to those reported here.
are uniformly distributed over the Dv,,v2elements in the underlying
domain. Let K, and K2 denote the sizes (measured in blocks) of the
two sets of tuples in relations R, and R2, respectively, which corre- The parameter values chosen for this illustrative Comparison are
spond tg the range [v,, vJ. Then we simply estimate that as follows: The relations R, and R2 to be joined each have one
Kc.= D r is approximately the size (again in blocks) of that part don tuples. There are ten thousand distinct values (D = 10000) in
of relatio”& ‘f corresponding to Y. Applying the type 2 formulas to each the domain for the join column values of relations R, and R2, having
such v within the range [v,, vJ, and summing across all such Y gives Zipf-like distributions with parameters 0, and 0,, respectively. Each
an estimate of the cost of the type 1 task. of the sorted runs and the output of the final join is assumed to have
X= 50 tuples per 4K block of data. The size of the memory buffer
It is important to note that the exact methodology employed to used for buffering tuples during the join phase is taken as 512 K bytes.
analyze the task times is completely orthogonal to the rest of the pa- The overhead of reading in the data, extracting the tuple, merging the
per. The SKEW algorithm will remain unaffected even if these for- sorted runs, and performing the join comparison is assumed to con-
mulas are modified. The method we use to estimate times for type sume 1 unit of CPU time (i.e., per block overhead A = X in Section
1 tasks can obviously be improved upon, but as we shall see in the 3). The join overhead for each output tupie generated, including in-
next section, even this simple estimation technique yields good re- serting the results into an output file and writing out the data to disk
is also assumed to consume 1 unit of CPU time (i.e., per block
sults.
110
overhead B = p in Section 3). We vary the number of processors
MElAl =o;THErA2=o
from I to 128, and use combinations of 0 values of 0 (pure Zipf for
the highly skewed case), 0.5 (moderate skew) and 1 (uniform).
Finally, the correlation between the specific skewed values in the two
relations is modeled as follows: The D distinct values of relation R,
are arranged in descending order of the number of tuples that have
this value in their join column. The correlation is modeled using a
single parameter C that takes on integer values from 1 to D. Then,
corresponding to the descending ordering of relation R,, the value in
R2 with the largest number of tuples is placed in a position chosen
randomly from I to C. The next most frequent value of R, is placed
in a randomly chosen position from 1 to C + 1, except that the posi-
tion occupied by the previous step is not allowed, and so on. Thus
C= 1 corresponds to perfect correlation, and C= D to the random NUMBER OF PROCESSORS
case. We choose C= 500 for our comparisons, which corresponds Figure 4.1. Speedup of the Join Phase - High Skew Case
to a moderate to high degree of correlation. Preliminary examination
of potential join columns in some actual databases supports such a
degree of correlation.
two lists is half the sum of the total number of elements in both lists.
We compare the speedups obtained using the proposed algorithm A similar algorithm is proposed in [IYER88].
with two heuristics. In the frost heuristic, the number of distinct val-
ues in the join column values is divided into P range partitions, each A word about the methodology employed in computing the
with (approximately) the same number of distinct values, and each speedup is in order: We are using the “actual” makespan of the
p&t& is assigned to one of the P processors. Then tuples from SKEW algorithm rather than the estimated makespan of Section 3.
each relation are shipped to the assigned processor and are merged This means that we plug in the actual distributions of values into the
and joined in the foal phase. Intuitively, merely dividing the distinct formulas of Section 3, but we employ the methods (I or 2) determined
values without regard to the number of tuples with the same value by our estimates, whether they be right or wrong. The actual task
can be expected to lead to poor speedups when the data is highly times for type I elements will thus be higher, in general, than the
skewed. For the speedups using this heuristic reported below, the times obtained if full knowledge of the distribution of values were
values are idealistic because perfect knowledge of the distinct values known beforehand.
in the join column is used in the assignment. In an implementation
the heuristic could be approximated by equally dividing the range
We fast consider the case where both the relations have highly
between the minimum and maximum values of each relation, as de-
skewed distributions of values in the join column. This case corre-
termined during the sort phase. Another simple scheme that ap sponds to 0, = 8, = 0, i.e., pure Zipf distributions for both re-
proximates this heuristic is to use a (uniform) hash partitioning that lations. Figure 4.1 shows the number of processors versus the
assigns a distinct value to a particular processor depending on the re- speedup of the join phase for this case for the proposed scheme and
sult of a hash function of the value [DEW187]. However, the results the two heuristics outlined above. In this context, the speedup is the
show that even with perfect information, this (naive) heuristic has ratio of the CPU time to complete the join phase on one processor
poor performance in the presence of moderate to high data skew. to the (actual makespan of the) time on P processors. The figure
shows a close to linear speedup for the proposed algorithm, and small
The second heuristic used for comparison purposes partitions the speedup for both the heuristics. The same data is displayed as a
two relations into P ranges such that the sum of the number of tuples normalized speedup in Figure 4.2. The normalized speedup for the
in each range from the union of both relations to be joined is (ap- join phase is defined as the ratio of the speedup to the number of
proximately) l/P ohthe total number of tuples in both relations, and processors. Therefore, a normalized speedup of unity represents the
each range is assigned to one of the P processors. The range parti- ideal case of perfect speedup. The figure shows that for the proposed
tioning can be done using the parallelixed version of the Galil- scheme, the normalized speedup is for the most part greater than 0.9,
Meggido algorithm outlined in Section 3. This method is similar to and usually close to unity. Virtually the entire reason for the depar-
that in [AKL87] in the context of merging two lists, where an algo- ture of the normalized speedup from unity is the difference between
rithm is proposed that breaks each of the two lists into two ranges the estimated CPU run time for a task and the actual run time. (LPT
such that the sum of the number of elements in the first range of the never gave a bad solution to the minimum makespan problem.) This
111
250
-
!s -
z -
,200
F -
h 150 -
9 -
&loo -
E -
$ 50 - -
o- -
2 4 6 16 32 64 121
NUMBER OF PROCESSORS NUMBER OF PROCESSORS
Figure 4.2. Normalized Speedup of the Join Phase - High Skew Figure 4.3. Number of Largest Type 2 Pairs Assigned to tasks
CaSe for the High Skew Case
discrepency, in turn, is caused by our simplistic assumption of a uni- Returning to Figure 4.2, we note that the heuristics do poorly for
form distribution within a range in estimating the time for the join this case because of the high data skew. For the fust heuristic, some
operations in a type 1 task. As described in Section 3, a stopping partitions have a disproportionately large number of tuples, leading
condition for the algorithm is the creation of a fixed number of tasks to long run times for the processor that is assigned the partition. This
per processor. Therefore, for a small number of processors, the effect becomes worse as the number of processors increases because
number of iterations in the algorithm is small. There are some type the run time becomes dominated by a few partitions. For the second
1 tasks that have a sizeable skew, giving rise to a discrepency between heuristic, though the number of tuples in each partition is the same,
the estimated and actual run times. As the number of processors in- the join output begins to dominate for processors that are assigned
creases, so do the number of iterations. Therefore, the estimates of the large skew values. Eventually, the largest skew elements dcter-
run times get better, leading to better speedups. We expect that better mine the makespan of this heuristic, and the speedup converges to
methods of estimating the times of a type 1 task will further improve that of the fist heuristic.
the speedups using the proposed method. Improved estimation
techniques have been devised and will be reported on later. This ar- Figure 4.5 shows the normalized speedup for the three algorithms
gument is supported by the bar chart in Figure 4.3. To understand for the case of a high skew on relation R, (0, = 0) and a medium skew
this chart, suppose that all the potential type 2 pairs are ordered by on relation R2 (0, = 0.5). Again, the normalized speedup for the
task times. Then Figure 4.3 shows, for each number of processors, proposed scheme is close to 0.9 for up to 128processors, for the same
the number of the largest potential type 2 pairs that were identified reasons as given above. The fust heuristic shows no improvement
by the algorithm and assigned to separate tasks before a miss occurs over the previous case. The second heuristic shows some improve-
-- in other words, the nexf largest potential type 2 pair was not iden-
tified, and occurred as part of a type 1 task instead. The chart shows
25
r
that the number of large type 2 pairs created by the algorithm in- I LARGEST -h’PE 2 PAIR
creases quickly as the number of processors increases. Therefore, the q SECOND
estimates for the run time of the remaining type 1 tasks improves with
the number of processors. Notice from Figure 4.2 the sudden im-
provement in the normalized speedup in going from 8 to 16 process-
ors. The reason for this behaviour can be seen from Figure 4.3, where
the number of the largest type 2 pairs assigned separate tasks increases
from 6 to 19 with this change. This leads to a significantly better es-
timate for the type 1 tasks, and therefore to the large improvement in
the normalized speedup. The multiplicities of the five largest type 2
pairs as a function of the number of processors is shown in the bar
0
chart of Figure 4.4. Note that while the multiplicities increase with 2 4 6 16 52 64 126
NUMBER OF PROCESSORS
the number of processors, they are still much smaller than the total
Figure 4.4. Multiplicity of 5 Largest type 2 pairs for the High
number of processors available.
Skew Case
112
A
A
+ PRDPDSED KIHDD
* . WRmc 1
--8-e HNRISTIC 2
I I I I I I I I I 1 I I
D 120
40 I)0 120 40 DO
Figure 4.5. Normalized Speedup of the Join Phase - Relation 1 Figure 4.7. Normalized Speedup of the Join Phase - Relation 1
(2) high (medium) Skew (2) medium (medium) Skew
The normalized speedup for a moderate skew in both relations

ment, but continues to be an order of magnitude worse than the
(0, = 8, = OS) is shown in Figure 4.7. The speedup for the proposed
proposed scheme for a large number of processors.
algorithm is slightly worse than in the previous cases. This is likely
to be dependent on the specific run, and is not expected to be a gen-
We next examine the speedup for the case of a single skew. Figure eral trend. Nevertheless, we expect to be able to improve on these
4.6 shows the normalized speedup with high skew for relation R, results with better type I task time estimation techniques as men-
(0, = 0) and a uniform distribution for relation R2 (0, = 1) . The tioned earlier. The two heuristics show considerable improvement
speedup for the proposed algorithm now is close to ideal. This is as compared to the previous cases. The explanation is that the
because the estimates for the run times are now close to the actuals. dominating effect of the highly skewed values of the previous cases is
Both heuristics, on the other hand, show marginal improvement over now ameliorated. However, for a large number of processors, the
the previous case. The explanation is that for both heuristics the proposed scheme still has about twice the speedup of either heuristic.
processors that are assigned the Iarge skew element of relation R,
dominate the join phase run time. In Figure 4.8 we examine the normalized speedup for the case of
moderate skew for relation R, (8, = 0.5) and uniform distribution for
‘c‘
40 80 120 40 80 120
Figure 4.6. Normalized Speedup of the Join Phase - Relation I Figure 4.8. Normalized Speedup of the Join Phase - Relation I
(2) high (no) Skew (2) medium (no) Skew
113
relation f& For this single (low) skew case, the speedup from the substantially enhance the speedups and reduce the overheads due to
proposed scheme is almost ideal, and the two heuristics are much SKEW, and will be reported on later.
improved. Even so, the speedup of the proposed scheme for 128
processors is about twice that of the first heuristic, and about a third
better than the second heuristic.
References:
Finally, we should point out that even our “actual” task times are AI1074 Aho, A.V., Hopcroft, J.E. and Uhrnan, J.D. (1974) The
Design and Analysis oy Computer A Igorithms, Addison-
really stochastic rather than deterministic in nature. Recent studies Wesley.
in [LAKS89a,b] have shown that the speedup achievable through AKL87 Akl, S.G., and Santroo, N. (1987) “Optimal Parallel
Merging and Sorting Without Memory Conflicts,”
horizontal growth can be quite sensitive to such variations in task *
IEEE Trans. on Conwuters. , Vol. C-36. No. 11.
times. One simple way to deal with this issue within the general 1367-1369.
BlC85 Bit L. and Hartman R. L. 11985) “Hither Hundreds of
context of our proposed algorithm is to force the processors to exe- Processors in a Database Machine,” Proceedings of the
cute their tasks in the same order (largest first) in which they were 1985 International Workshop on Database Machines,
Springer Verlag.
assigned by LPT. During the course of the join phase the processors BLAS77 Blasgen M. and Eswaran K. (1977) “Storage and Access
could report their progress. If the quality of the SKEW solution de- in Relational Databases,” IBM Sysfemr Journal , Vol.
4, p. 363.
grades past a certain predetermined threshold a new LPT algorithm BLUM72 Blum, M., Floyd, R.W., Pratt, V.R., Rivest, R.L. and
could be initiated to handle the tasks remaining. Obviously, one Tarjan, R.E. (1972) “Time Bounds for Selection,“,
Journal of Computer and System Sciences, Vol. 7,448
would have to modify the timing of the transfer phase somewhat to - 461.
allow for such a scheme. Slightly more elaborate approaches could BRAT84 Bratsbergsengen K. (1984) “Hashing Methods and Re-
lational Algebra Operations,” Proceedings of the 10th
also be devised. For example, the type 2 tasks are more deterministic International Conference on Very Large Databases.
than the type 1 tasks, so some of the smaller of these type 2 tasks CHR183 Christodoulakis S. (1983) “Estimating Record Selectiv-
ities,“‘Inrrmafion .Sysfem.r,Vol. 8, No. 2, 105 - 115.
might be executed last. COFF73 Coffman E. and Demting P.J. (1973) Operating Sys-
fear Theory, Prentice Hall.
COFF78 Coffman E., Carey M. and Johnson, D.S. (1978) “An
5. Summary: ADDhcation of Bin Pa&m? to Multiorocessor Sched-
r&g,’ SIAM Journal of &nputing, Vol. 7, l-17.
CORN86 Cornell. D.W.. Dias. D.M.. and Yu. P.S. 11986) “On
Conventional parallel join algorithms perform poorly in the pres- Multi-System ‘Coupling through Function Request
Shipping,” ZEEE Trans. Software Eng., Vol. SE-12, No.
ence of data skew. In this paper, we propose a parallel sort merge 10, 1006-1017.
join algorithm which can effectively handle the data skew problem. DEMUBS Demurjian S.A., Hsiao D.K., Kerr D.S., Menon J.,
Strawscr P.R., Tekampe R.C., Trimble J. and Watson
The proposed algorithm introduces a scheduling phase in addition to R.J. (1985) ‘Performance Evaluation of a Database
the usual sort, transfer and join phases. During the scheduling phase, System in Multiple Backend Configurations,” Pro-
a pamllelizable optiruhation algorithm, using the output of the sort ceedings of the 1985 international Workshop on Data-
base Machines, Springer Verlag.
phase, attempts to balance the load across the multiple processors in DEW185 Dewitt, D.J. and Gerber R.H. (1985) “Multiorocessor
the subsequent join phase. Two basic optimization techniques are Hash-based join algorithms,’ Pioceebings of* the 11th
International Conference on Verv Lame Databases.
employed repeatedly. One solves the selection problem, while the DEW186 Dewitt, D.J., Gerber, R.H., Graefe, C?:.,Heytens, M.L.,
other heuristically solves the minimum makespan problem. Our ap- Kumar, K.B. and Muralikrishna, M. (1986)
“GAMMA- A High Performance Datatlow Database
proach naturally identities the largest skew elements and assigns each Machine,” Proceedings of the 12th International Con-
of them to an optimal number of processors. The algorithm is dem- ference on Very Large Databases.
DEW187 Dewitt, D.J., Smith M. and Boral H. (1987) “A
onstrated to achieve very good load balancing for the join phase in a Single-User Performance Evaluation of the Teradata
Database Machine,” h4 CC Technical Report
CPU-bound environment and to be very robust relative to the degree 88-081-87.
of data skew and the number of processors. A Zipf-l&e distribution FRED82 Frederickson G. and Johnson D.B. (1982) +The Com-
plexity of Selection and Ranking in X + Y and Matrices
is used to model the data skew. Although we assume that the system with Sorted Columns,” Journal of Computer and Sys-
is CPU-bound and thus that a CPU pathlength estimate bc used in tem Sciences, Vol. 24, 197-209.
FRED84 Frederickson G. and Johnson D.B. (1984) “Generalized
the objective function, we expect other environments can be similarly Selection and Ranking: Sorted Matrices,” SIAM Jour-
handled merely by employing a different objective function. nal of Comoutina. Vol. 13. 14-30.
GAL179 Galii Z. and M&iddo N. ‘( 1979) “A Fast Selection Al-
gorithm and the Problem of Optimum Distribution of
Effort,” Journal of the ACM, Vol. 26, 1979, 58-64.
A hash join version of our parallel join algorithm has also been GRAl-169 Graham R. (1969) “Bounds on Multiprocessing Timing
devised lYOLF901. We are in the process of testing our algorithm Anomalies,” SIAM Journal of Appl. Math., Vol. 17,
No. 2, 416-429.
against real databases. We have also devised improved type J task lJSIA83 Hsiao D.K. (1983) Advanced Database Machine Archi-
estimation techniques, and are examining methods to speed up the tecture, Prentice-Hall.
lBAR88 Ibamki, T. and Katoh, N. (1988) Resource Allocation
scheduling phase. Relatively small changes to our current approach Problems. MIT Press.
114
IYERSB Iyer, B.R., and Dias, D.M. (1988) “System Issues in
Parallel Sorling for Database Systems,” IBM Research
Reoort RJ 6585.
IYER89 Iy&, B.R., Ricard, G.R. and Varman, P.J. (1989)
“Percentile Fhdii~ Algorithm for Multi& Sorted
Runs,” Proceedings-of rhi 15th Infernationai’Confirence
on Very Large Databases .
KITS83 Kitsuregawa, M., Tanaka, II. and Motooka, T., Appli-
cafion of Hash to Data Bare Machine and its Archilec-
lure New Generation Computing, Vol. 1, No. 1, 1983.
KNUT73 Knuth D.E. (1973) The Art of Cornpurer Program-
ming, Volume 3: Sorting and Searching, Addison-
Wesley.
LAKS88 Lakshmi S. and Yu P.S. (1988) “Effect of Skew on Join
Performance in Parallel Architectures,” Proceedings
Intl. Symposium on Databases in Parallel and Distrib-
uted Syslemc.
LAKS89a lakshmi S. and Yu P.S. (1989) “Limiting Factors of
Join Performance on Parallel Processors,” Proceedings
oflhe 5th Intl. Conf: on Dala Engineering.
LAKS89b Lakshmi S. and Yu P.S. (1989) “Analysis of Parallel
Processing Architectures for Database System,” Pro-
ceedings of the I989 Intf. Con/: on Parallel Processing.
LYNC88 Lynch C.A. (1988) ‘Selectivity Estimation and Query
Optimization in Large Databases with Highly Skewed
Distributions of Column Values,” Proceedings of the
14th International Conference on Very Large
Databates.
MONT83 Montgomery A.Y., D’Souza D.J. and Lee S.B. (1983)
The Cost of Relational Algebraic Operations on
Skewed Data: Estimates and Experiments,” Informalion
Processing 83, IFIP.
NECHM Neches P.M. and Shemer J.E. (1984) “The Genesis of
a Database Computer,” IEEE Compufer , Vol. 17, No.
11,42- 56.
OZKA86 Ozkarahan E. ( 1986) Dalabase Machines and Database
Management, &entice Hall.
QADA85 Qadah, G.Z. (1985) “The Equi-Join Operation on a
Multiprocessor Database Machine: Algorithms and the
Evaluation of their Performance,’ Proceedings of the
1985 Inlemational Workshoo on Database Machines,
Springer Verlag, 35 - 67. ’
SALZ83 Salza S., Tenanova M. and Velardi P. (1983) “Per-
formance Modelinn of the DBMAC Architecture,”
Proceedings of the- 1983 International Workshop &
Database Machines, Springer-Verlag, 74 - 90.
SCHN89 Schneider, D.A. and Dewitt, D.J. (1989) “A Perfoxm-
ante Evaluation of Four Parallel Join Algorithms in a
Shared-Nothing Multiprocessor Environment,” A CM
Sigmod Conference, Portland, Oregon. 110 - 121.
STON86 Stonebraker. M. (1986) “The Case for Shared Noth-
ing,” tEEE batab&e E&ineering, Vol. 9, No. I.
TANT88 Tantawi, A.N., Towsley, D. and Wolf, J. (1988) “Op-
timal Allocation of Multiple Class Resources in Com-
puter Systems,” ACM Sigmetrics Conference, Sanla Fe,
New Mexico. 253 - 260.
VALD84 Valduriez P.‘and Gardarin G. (1984) “Join and Semi-
Join Algorithms for a Multiprocessor Database Ma-
chine,” ACM Transactions on Database Systems, Vol.
9. No. 1, 133 - 161.
WOLF90 Wolf, J.L., Dias, D.M., and Yu, P.S., “An Effective
Algorithm for Parallelizing Hash Joins in the Presence
of Data Skew”, IBM Research Report RC 15510, 1990.
YU87 Yu, P.S., Dias, D.M., Robinson, J.T., lyer, B.R., and
Cornell, D.W. (1987) “On Coupling Multi-Systems
Through Data Sharing”, Proceedings of the IEEE, Vol.
75, No. 5, 573-587.
Z1PF49 Zipf G.K. (1949) Human Behavior and the Principle of
Least Efforts Addison-Wesley.
115

An Effective Algorithm For Parallelizing

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

An Effective Algorithm For Parallelizing

Uploaded by

Copyright:

Available Formats

An Effective Algorithm for Parallelizing Sort Merge Joins in the Presence of Data Skew

Joel L. Wolf, Daniel M. Dias and Philip S. Yu

P.0. Box 704, Yorktown Hei&ts, N.Y. 10598

Abstract: To exploit parallelism, queries are divided into multiple tasks

In this section we outline the system architecture that we assume,

is minimized. This optimization problem is essentially the so-called

If N 2 P then apply LPT.

Melhod 2: In this case, it is clear that & =&+,SK,.

A[ +K:+2K, -K,S+S- l]+B&K;

if & K, > S - 1, and a time per processor of...

AC&T+ l]K, +ByMULT K:

otherwise. Again, the second expression corresponds to the case

Figure 3.2. Total task time as a function of multiplicity

NUMBER OF PROCESSORS NUMBER OF PROCESSORS

The normalized speedup for a moderate skew in both relations

You might also like