Professional Documents
Culture Documents
An Effective Algorithm For Parallelizing
An Effective Algorithm For Parallelizing
103
CH2895-1/90/0000/0103$01 .OO0 1990 IEEE
even single skew can have a large effect on performance. Finally, for jective function.
examining the effect of data skew, they examine the case of 8
processors. We will show that the effect of data skew becomes more
We show that the improvement in the join phase over conven-
pronounced as the number of processors is increased. Obviously,
tional algorithms is drastic in the high skew case. In fact, the pro-
more processors will be utilized in future database machines. Our
posed algorithm is demonstrated to achieve very good load balancing
paper examines cases with up to 128 processors.
for the join phase in all cases, being very robust relative to the degree
of data skew and the number of processors. A Zipf-like distribution
The sort merge join [BLAS77] and hash join [KITS83, BRAT84, is used to model the data skew.
DEW1851 methods are popular algorithms for computing the equi-
join of two relations. In this paper we examine the sort merge join
The environment and assumptions are described in Section 2.
method and propose an effective way to deal with the data skew
The scheduling phase algorithm is presented in Section 3. In Section
problem. In a typical pam.Uel sort merge join, e.g. [IYERtJE], each
4, a sensitivity analysis is provided to demonstrate the robustness of
of the relations is fust sorted, in parallel, according to the join col-
algorithm and the speedup in the join phase over conventional algo-
umn. This is called the sort phase. A transfer phase follows, in which
rithms. Finally, in Section 5, we summarize the results and outline
the output of the sort phase is shipped to the various processors ac-
our future work. A hash join version of our parallel join algorithm
cording to some range algorithm. Finally, in the join phase, the
has also been devised [WOLF90].
sorted ranges are merged and joined. Each processor handles its own
range of data. Conventional parallel join algorithms do not capture
the effects of skew distribution io the join column. As indicated, the 2. Environment and Assumptions:
impact to performance can be devastating [LAKSESJ.
104
- - To examine the speedup achievable in the join phase by the al-
I
"- gorithm proposed in this paper, we use synthetic data for the values
t
e in the join column, based on a Zipf-lie distribution [KNUT73] as
r N follows: We assume that the dotnain of the join column has D dis-
ce
0 t tinct values. The the probability ,q that the join column value of a
nw-
n0
particular tuple takes on the ilh value in the domain 1 I is D is
e r
i,= c/P@) , where c = l/&l/’ l(l - @)
) is a normalization constant. We
ck
t also assume that each t$de’s join column value is independently
i
o- chosen from this distribution. Setting the parameter 0 = 0 corre-
”
u I sponds to the pure Zipf distribution, which is highly skewed, while
Control Database Disks 0 = 1 corresponds to the uniform distribution. We will use B = 0.5
Processor Processor ‘S
as a case of moderate skew. The Zipf-like distributions corresponding
Figure 2.1. Data Partitioning Architecture
to D = 100 and 0 = 0.0, .25, S, .75 and 1.0 are shown in Figure 2.2.
In ELYNC883, data from large bibliographic databases are used to
sorted run on its local disk. This sort phase could be done using an
support models of skewed column value distributions based on Zipf
optimal tournament tree external sort as in [lYIiR88]. The second
distributions. See also [WOLF903.
phase of this algorithm is a scheduling phase that attempts to split the
join execution into tasks and assign tasks to the processors in an op-
timal manner so as to minimize the overall completion time, or Though the data vaIues are assumed to be skewed, we assume that
makespan. It is this scheduling phase that is the crucial aspect of our the partitioning function is such that the relations to be joined, R, and
paper, and the algorithm is described in detail in Section 3. The third R2, are more or less uniformly partitioned among the processors, i.e.,
phase is a transfer phase in which the data from different ranges of each processor has a comparable number of tuples of each relation.
each of the sorted relations is shipped to the processor(s) assigned For example, if the tuples are range partitioned on the primary key,
during the scheduling phase. Since the scheduling phase partitions then the ranges can be adjusted to approximately balance the number
the data into ranges or single distinct values (as described later), this of tuples in each partition. For most skew distributions and numbers
transfer phase can be accomplished by a single pass through the data. of processors, such range partitioning will lead to good balance.
Finally, in the join phase, the sorted ranges are read from local disk,
merged and joined, and the join outputs w&ten to disk. 3. Scheduling Phase Algorithm:
As described above, the transfer phase involves an additional pass To introduce the algorithm which forms the scheduling phase of
through the sorted runs of the two relations to be joined. It is pos- the proposed sort merge join approach, suppose that Y, I v, are two
sible to do this phase without extracting tuplcs from data blocks values in the domain of the join cohmms. Let P denote the number
provided that the relative byte addresses (RIMS) of the partition of processors. Given any of the 2P sorted runs created during the sort
boundaries are determined in the scheduling phase. This would phase, for example the one corresponding to processor p&{ I,...,Pj and
considerably reduce the overhead of the phase. Iyurther optimizations relation r&{ 1,2), there is a well-defined (possibly empty) contiguous
are possible, such as combining this transfer step with the join phase. subset Pi,,..,. .I consisting of all rows with sort column values in the
For instance, assuming the join operation is CPU-bound, only one interval [v,, ~1. Shipping each of the pP,,,“,,”1 over to a single
join task need be active on each processor at any time. This task is processor for fmal merging and joining results in an independent task
assigned to merge ranges (or a single value) from both relations. T?,,~~of the total remaining part of the join operation. (The super-
These ranges can be read at each processor, shipped to the assigned script here underscores the fact that a single processor is involved.
processor, merged with the corresponding ranges from the other The signiticance of this will become apparent shortly.) Assume that
processor, and joined, all in one pass through the data. ‘I’he penalty we can estimate the time it takes to perform this task, as we shall do
of such a scheme is that each task now has a set-up overhead of in Section 3.4.
starting the range reads on all the other processors. The need for
additional data buffers and synchronization delays in reading data Given Y, I v,, precisely one of two special cases may occur: Either
from diierent processors increase the complexity of such an opti- v, ( v2 9 or v, = v2. We shall call a pair (v,, q) satisfying v, <v, a
mization. In a data sharing environment, all processors have direct type 1 pair. In the case where Y, = v,. the join output is just the
access to the disks. Therefore, the transfer phase can be eliminated cross-product of the two inputs. We shall call a pair (v,, v,) satisfying
for such an architecture. Y, = v, a type 2 pair. Actually, for type 2 pairs, say with Y = v, = v,,
we may wish to consider the additional possibility of partitioning
105
ZIPF-LIKE DISTRIBUTIONS
“perfect” assignment, a@” necessarily possible, would have each
processor busy for x I: TIME;,,ln3/P units of time.) Spccitically,
n=* rn=l.
we would like to assrgn each task 7m““,,.“”3 to a processor
AS=W;,,,,J in such a way that the completion time of the total
- META-0.W
ioh
mETA-0.w
----- mElA-o.so
- - niEr*-0.75
‘. .’ THErA-1.00
106
it does not appear to parallelize easily. In contrast, the Galil and 3.2. GM:
Megiddo algorithm parallelizes naturally to P processors, each han- Procedure: GM
dling the two sets of sorted runs they created in the first place. We
Input: For each js( 1,...J), a c01un~1(~1 i = TOPj,...t BOT,) of non-
remark, however, that the Frederickson and Johnson algorithm may
&creasing elements, where TOPj 5 BOT, are indices of the column
be the algorithm of choice for handling skews in sorting. (The precise
ranges under consideration.
three-region partitioning required for correctness in joining is not
needed for sorting.) See [IYER89] for yet another selection problem Output: An Ith smallest element q, and, for eachjs{ l,...,JJ, a partition
algorithm applied to that particular problem.) [IBARBB] contains of each range {TOP,,..., BOT,} into three new ranges,
good descriptions of the GM and other selection problem algorithms, {TOP,! ,..., SOT} with values less than ‘1, {TOP: ,..., BOTf} with values
but for completeness, and because of the slightly different use to equal to ‘I, and {TOP;,..., 80’1;3) with values greater than ‘I.
which we put it, we shall describe GM in Section 3.2. See also Set 7; = TOP, and B, = BOT, for each j&(1,..., J). Set i = 1.
[TANTBB] for a computer science application of a generalization of Do forever.
the selection problem.
SetMi=8,-T,+lforeachjs{l,...,J). SetS=$MI
j=l
Algorithmic descriptions and notes on LPT appear in Section 3. I. For eachjc(l,...J), !ind the median element q, of the set
Section 3.2 handles GM. Section 3.3 deals with the proposed sched- {a,J 1i = r,,..., B,). Sort the medians in non-decreasing order,
uling algorithm itself, labeled SKEW. SKEW works by repeatedly so that ‘115 ... skqj,. Compute the value k such that
h-1
switching back and forth between LPT and GM. In Section 3.4 we xMj, < -$ and xMji L 3. Set u= qj*.
I4 i=l
deal with task time estimation. Compute for each jc{ I,...a,
Tjr; = min {il(TOP, 5 i_< 807;) A (ai, = q)), and
3.1. LPT:
SE, = max {il (TOP, I i 5 BO?) r, ((I,~ = q)}. Set
Procedure: LPT M’ = x(T7; - T,) and M2 = C(BB, - T, + 1).
j=l j=l
Input: Number of processors P, number of tasks fi , and task times IfCM1<jIM’l
{TIME”1 n = l,...,lir).
Then begin
Output: A heuristic assignment of the tasks to the processors which
approximately minimizes the makespan. n is an Ith smallest element. Set TOP,! = TOP,,
807;’ = T7; - 1, TOP,? = TT., 807: = BB,,
Sort the tasks (ii necessary) in order of decreasing TIME,.
TOP: = BB, + 1, and 807: = BOT, for each jr(l,...J).
Set TOTAL, = 0 for each processor p. Halt.
Doforn=lto$. End
Assign task n to the processor p for which TOTAL, is mini- If [ML 2 i] then set B, = TT, for each j&{ l,...J).
mum. (Ties are decided in favor of smaller p.)
lf [M, < r^] then decrement ! by M’ and set 7; = BBj + 1 for
Add TIME, to TOTAL,,. each jc[ l,...JJ.
End do.
End do.
End LPT
End GM
Notes on LPT:
Notes on GM:
. The makespan in the algorithm is represented by es;TOTALp .
We will always apply GM in the case where J= 2P, twice the
. Considerable work has been done on analyzing the worst-case
number of processors. We will always be looking for the median
behavior of LPT (and MULTIFIT). We reiterate that the
element.
worst-case and average-case behaviors are far apart, worst-case
behavior being good, and average-case behavior beiig excellent. For ease of exposition, we have purposely ignored the details of
casesin which a column or region therein is (or becomes) empty.
. The computational complexity of LPT is O& log i + log P)).
1 h The details are somewhat messy, and not essential to under-
The presumably dominant term, N log N, comes from the sorting
standing the algorithm.
step, for which we employ QUICKSORT. See [AH074].
The TT, and BB, values be found by binary search. This can be
done in parallel by each of the P processors. The median ele-
ments ‘1, can also be found in parallel.
107
3.3. SKEW:
- Procedure: SKEW
Input: Number of processors P, 2P sets of sorted runs,
{aim,.,1i = l,..., CARL),,,) , one for each processor pa{ l,..., P] and each
relation rc{ I ,2}, where CARD,, , is the card&&y of the sorted run
of relation r at processor p, and OT,.~,, is the ifh tuple in this sorted run.
-
III
- Output: The creation of tasks and a heuristic assignment of those
222
222 tasks to the processors which approximately minimizes the
zi makespan.
-
Set the number of tasks N = 1.
Set the top and bottom of the first task to be TOP,,, = 1 and
BoTw p,, = CARD,,, for each processor p = l,...,P and each re-
lation r = 1,2.
Determine the type (1 or 2) of the first task.
Do forever.
Determine the optimal multiplicities MULT, of each type 2
L - J
task nc{l,...,N). (Set MULT, = I for each type 1 task
ne{l...,N). ) Compute the total number of tasks to be
Figure 3.1. Using GM to Subdivide a Type I Task
i = $MULT, .
n=,
Compute the task times {TfMEfULrnln= I,..+).
End
Else halt with solution from final LPT.
End
End SKEW
108
Notes on SKEW: to the median element of the old task is labeled with 2s. The other
two new tasks are labeled with Is and 3s, respectively. These latter
. In the extremely likely event that the first task created is of type
two tasks may be of type 1, in which case they may be candidates for
1, the task would correspond to performing the entire job phase
subdivision themselves at some further point in the algorithm.
on a single processor. If it is of type 2 instead, then the entire
join is the join of a single element, so that we are forming a full
cross-product of the rows of the two relations. The optimal 3.4. Task time estimation:
multiplicity in this case will be determined to be p, and the al-
In this section we derive the task time estimation formulas. To begin
gorithm will halt with an essentially perfect solution.
with, assume that we have a type 2 task 7:. of multiplicity M ULT.
. In general, the optimal multiplicity for a type 2 task n will be that (The formulas to handle type 1 tasks will be based on the type 2
m with 1 I m < P and task time TIME: with the smallest total formula.) Let K, and K2 denote the sizes (measured in blocks) of the
time mTlME: subject to the constrainis that two sets of tuples in relations R, and R,, respectively, which corre-
TIME: 5 (m TIME: + REST)/P , where REST is the combined spond to the value Y. For ease of exposition, let us assume that K,
time of all other tasks, and that TIME; ;r MINTIME , where is the larger of the two. (The formulas wiIl merely need to be
MINTZME, an input variable, is the largest size task which switched if the reverse is true.) Suppose that S is the memory buffer
SKEW is not allowed to subdivide. The first constraint has the size (also in blocks) for each processor.
effect of requiring m to be greater than some minimum value,
while the second constraint has the opposite effect. The fust We can either split K, into MULT equal parts, or we can split K2
constraint ensures that each individual task must fit within one into MULT equal parts. Let us label these as Methods 1 and 2, re-
of the P processors. MINTIME is used to guard against splitting spectively. We will ultimately pick the method which gives the lowest
tasks too finely. (We do not model task initiation times explic- task time. Whichever method we employ, we will let the larger
itly, but by properly setting MINTIME, we have the same effect.
component correspond to the outer loop, and the smaller component
In fact, the algorithm could be made to “throw away” the small-
correspond to the inner loop. This is provably better than the re-
est tasks it creates, by coalescing them with one of their neigh- verse. The component corresponding to the outer loop will be alIo-
bors.) Since the question of whether a type 2 task fits or not
cated 1 block in the memory buffer, while the component
depends on the multiplicities of the other type 2 tasks, we cycle corresponding to the inner block will use the remaining S - 1 blocks.
through the type 2 tasks in order of size, determining optimal The blocks of the inner loop component cycle through the memory
multiplicities, and then repeat the process until the multiplicities buffer once for each block of the outer loop component, in an
remain stable throughout a complete cycle. altematingly forwards and backwards manner. (This approach might
. The solution can be unacceptable for several reasons. The most accordingly be dubbed the ZIGZAG algorithm.) We thus utilize the
obvious is that the quality of the LPT solution is not within memory in a way which minimizes the total number of blocks that
some input variable TOLERANCE. (If SOLPT denotes the need to be read. Let y = K2/K,. By our convention, 0 5 y < 1 .
makespan of the LPT solution, and SOL,,,,, denotes the
makespan of a “perfect” solution, the quality of the LPT solution Method I: In this case, it is not apparent which of the two values,
WiIl be acceptable if (SOL,,,-
KJMULT or K2 = yK,, is larger. So we let min = min(l/MULT, y) ,
s%mFEcrYso‘LPr
-z TOLERANCE.) However, the following reasons for failure are and max = max(l/MULT, v), A simple analysis then yields a time
also valid: First, it may happen early on in the algorithm that per processor of...
s-c p, in which case LPT is not even called. Second, it may
happen that the time TIMEJ, of the largest type 1 task n may A[ &Kf+2maxK,--maxK,S+S-l]+B- MULT K:
satisfy TIME; 5 MINTIME. Finally, it may happen that the
number $ of tasks already created may satisfy i 2 MAXT, if min K, 2 S - 1, and a time per processor of...
where MAXT is an input variable designed to keep the algorithm
from running too long. Generally, setting MAXT to be on the A[ K:
order of IO times the number of processors proves quite satis-
factory. otherwise. Here, A is a coeficient which equals the per block
pathlength overhead of reading in the data, extracting the tuples,
. We again use QUICKSORT to perform the sorting.
merging the sorted runs, and performing the join comparison. B is a
coefficient which equals the pathlength overhead of inserting the
Figure 3.1 shows a type 1 task being subdivided into three new output tuples generated by joining one block of tuples (with identical
tasks. The entire 2P sets of sorted runs are shown, with the old type join column values) from each of relations R, and R2 into an output
1 task labeled with Is, 2s and 3s. The new type 2 task corresponding file and writing out the data. The second expression corresponds to
109
the case where the smaller component fils in the memory buffer, SPmNG A MPE 2 TASK AMONG MULTIPLE PROCESSORS
while the first expression corresponds to the case where it does not.
110
overhead B = p in Section 3). We vary the number of processors
MElAl =o;THErA2=o
from I to 128, and use combinations of 0 values of 0 (pure Zipf for
the highly skewed case), 0.5 (moderate skew) and 1 (uniform).
Finally, the correlation between the specific skewed values in the two
relations is modeled as follows: The D distinct values of relation R,
are arranged in descending order of the number of tuples that have
this value in their join column. The correlation is modeled using a
single parameter C that takes on integer values from 1 to D. Then,
corresponding to the descending ordering of relation R,, the value in
R2 with the largest number of tuples is placed in a position chosen
randomly from I to C. The next most frequent value of R, is placed
in a randomly chosen position from 1 to C + 1, except that the posi-
tion occupied by the previous step is not allowed, and so on. Thus
C= 1 corresponds to perfect correlation, and C= D to the random NUMBER OF PROCESSORS
case. We choose C= 500 for our comparisons, which corresponds Figure 4.1. Speedup of the Join Phase - High Skew Case
to a moderate to high degree of correlation. Preliminary examination
of potential join columns in some actual databases supports such a
degree of correlation.
two lists is half the sum of the total number of elements in both lists.
We compare the speedups obtained using the proposed algorithm A similar algorithm is proposed in [IYER88].
with two heuristics. In the frost heuristic, the number of distinct val-
ues in the join column values is divided into P range partitions, each A word about the methodology employed in computing the
with (approximately) the same number of distinct values, and each speedup is in order: We are using the “actual” makespan of the
p&t& is assigned to one of the P processors. Then tuples from SKEW algorithm rather than the estimated makespan of Section 3.
each relation are shipped to the assigned processor and are merged This means that we plug in the actual distributions of values into the
and joined in the foal phase. Intuitively, merely dividing the distinct formulas of Section 3, but we employ the methods (I or 2) determined
values without regard to the number of tuples with the same value by our estimates, whether they be right or wrong. The actual task
can be expected to lead to poor speedups when the data is highly times for type I elements will thus be higher, in general, than the
skewed. For the speedups using this heuristic reported below, the times obtained if full knowledge of the distribution of values were
values are idealistic because perfect knowledge of the distinct values known beforehand.
in the join column is used in the assignment. In an implementation
the heuristic could be approximated by equally dividing the range
We fast consider the case where both the relations have highly
between the minimum and maximum values of each relation, as de-
skewed distributions of values in the join column. This case corre-
termined during the sort phase. Another simple scheme that ap sponds to 0, = 8, = 0, i.e., pure Zipf distributions for both re-
proximates this heuristic is to use a (uniform) hash partitioning that lations. Figure 4.1 shows the number of processors versus the
assigns a distinct value to a particular processor depending on the re- speedup of the join phase for this case for the proposed scheme and
sult of a hash function of the value [DEW187]. However, the results the two heuristics outlined above. In this context, the speedup is the
show that even with perfect information, this (naive) heuristic has ratio of the CPU time to complete the join phase on one processor
poor performance in the presence of moderate to high data skew. to the (actual makespan of the) time on P processors. The figure
shows a close to linear speedup for the proposed algorithm, and small
The second heuristic used for comparison purposes partitions the speedup for both the heuristics. The same data is displayed as a
two relations into P ranges such that the sum of the number of tuples normalized speedup in Figure 4.2. The normalized speedup for the
in each range from the union of both relations to be joined is (ap- join phase is defined as the ratio of the speedup to the number of
proximately) l/P ohthe total number of tuples in both relations, and processors. Therefore, a normalized speedup of unity represents the
each range is assigned to one of the P processors. The range parti- ideal case of perfect speedup. The figure shows that for the proposed
tioning can be done using the parallelixed version of the Galil- scheme, the normalized speedup is for the most part greater than 0.9,
Meggido algorithm outlined in Section 3. This method is similar to and usually close to unity. Virtually the entire reason for the depar-
that in [AKL87] in the context of merging two lists, where an algo- ture of the normalized speedup from unity is the difference between
rithm is proposed that breaks each of the two lists into two ranges the estimated CPU run time for a task and the actual run time. (LPT
such that the sum of the number of elements in the first range of the never gave a bad solution to the minimum makespan problem.) This
111
250
-
!s -
z -
,200
F -
h 150 -
9 -
&loo -
E -
$ 50 - -
o- -
2 4 6 16 32 64 121
NUMBER OF PROCESSORS NUMBER OF PROCESSORS
Figure 4.2. Normalized Speedup of the Join Phase - High Skew Figure 4.3. Number of Largest Type 2 Pairs Assigned to tasks
CaSe for the High Skew Case
discrepency, in turn, is caused by our simplistic assumption of a uni- Returning to Figure 4.2, we note that the heuristics do poorly for
form distribution within a range in estimating the time for the join this case because of the high data skew. For the fust heuristic, some
operations in a type 1 task. As described in Section 3, a stopping partitions have a disproportionately large number of tuples, leading
condition for the algorithm is the creation of a fixed number of tasks to long run times for the processor that is assigned the partition. This
per processor. Therefore, for a small number of processors, the effect becomes worse as the number of processors increases because
number of iterations in the algorithm is small. There are some type the run time becomes dominated by a few partitions. For the second
1 tasks that have a sizeable skew, giving rise to a discrepency between heuristic, though the number of tuples in each partition is the same,
the estimated and actual run times. As the number of processors in- the join output begins to dominate for processors that are assigned
creases, so do the number of iterations. Therefore, the estimates of the large skew values. Eventually, the largest skew elements dcter-
run times get better, leading to better speedups. We expect that better mine the makespan of this heuristic, and the speedup converges to
methods of estimating the times of a type 1 task will further improve that of the fist heuristic.
the speedups using the proposed method. Improved estimation
techniques have been devised and will be reported on later. This ar- Figure 4.5 shows the normalized speedup for the three algorithms
gument is supported by the bar chart in Figure 4.3. To understand for the case of a high skew on relation R, (0, = 0) and a medium skew
this chart, suppose that all the potential type 2 pairs are ordered by on relation R2 (0, = 0.5). Again, the normalized speedup for the
task times. Then Figure 4.3 shows, for each number of processors, proposed scheme is close to 0.9 for up to 128processors, for the same
the number of the largest potential type 2 pairs that were identified reasons as given above. The fust heuristic shows no improvement
by the algorithm and assigned to separate tasks before a miss occurs over the previous case. The second heuristic shows some improve-
-- in other words, the nexf largest potential type 2 pair was not iden-
tified, and occurred as part of a type 1 task instead. The chart shows
25
r
that the number of large type 2 pairs created by the algorithm in- I LARGEST -h’PE 2 PAIR
creases quickly as the number of processors increases. Therefore, the q SECOND
estimates for the run time of the remaining type 1 tasks improves with
the number of processors. Notice from Figure 4.2 the sudden im-
provement in the normalized speedup in going from 8 to 16 process-
ors. The reason for this behaviour can be seen from Figure 4.3, where
the number of the largest type 2 pairs assigned separate tasks increases
from 6 to 19 with this change. This leads to a significantly better es-
timate for the type 1 tasks, and therefore to the large improvement in
the normalized speedup. The multiplicities of the five largest type 2
pairs as a function of the number of processors is shown in the bar
0
chart of Figure 4.4. Note that while the multiplicities increase with 2 4 6 16 52 64 126
NUMBER OF PROCESSORS
the number of processors, they are still much smaller than the total
Figure 4.4. Multiplicity of 5 Largest type 2 pairs for the High
number of processors available.
Skew Case
112
A
A
+ PRDPDSED KIHDD
* . WRmc 1
--8-e HNRISTIC 2
I I I I I I I I I 1 I I
D 120
40 I)0 120 40 DO
Figure 4.5. Normalized Speedup of the Join Phase - Relation 1 Figure 4.7. Normalized Speedup of the Join Phase - Relation 1
(2) high (medium) Skew (2) medium (medium) Skew
‘c‘
40 80 120 40 80 120
NUMBER OF PROCESSORS NUMBER OF PROCESSORS
Figure 4.6. Normalized Speedup of the Join Phase - Relation I Figure 4.8. Normalized Speedup of the Join Phase - Relation I
(2) high (no) Skew (2) medium (no) Skew
113
relation f& For this single (low) skew case, the speedup from the substantially enhance the speedups and reduce the overheads due to
proposed scheme is almost ideal, and the two heuristics are much SKEW, and will be reported on later.
improved. Even so, the speedup of the proposed scheme for 128
processors is about twice that of the first heuristic, and about a third
better than the second heuristic.
References:
Finally, we should point out that even our “actual” task times are AI1074 Aho, A.V., Hopcroft, J.E. and Uhrnan, J.D. (1974) The
Design and Analysis oy Computer A Igorithms, Addison-
really stochastic rather than deterministic in nature. Recent studies Wesley.
in [LAKS89a,b] have shown that the speedup achievable through AKL87 Akl, S.G., and Santroo, N. (1987) “Optimal Parallel
Merging and Sorting Without Memory Conflicts,”
horizontal growth can be quite sensitive to such variations in task *
IEEE Trans. on Conwuters. , Vol. C-36. No. 11.
times. One simple way to deal with this issue within the general 1367-1369.
BlC85 Bit L. and Hartman R. L. 11985) “Hither Hundreds of
context of our proposed algorithm is to force the processors to exe- Processors in a Database Machine,” Proceedings of the
cute their tasks in the same order (largest first) in which they were 1985 International Workshop on Database Machines,
Springer Verlag.
assigned by LPT. During the course of the join phase the processors BLAS77 Blasgen M. and Eswaran K. (1977) “Storage and Access
could report their progress. If the quality of the SKEW solution de- in Relational Databases,” IBM Sysfemr Journal , Vol.
4, p. 363.
grades past a certain predetermined threshold a new LPT algorithm BLUM72 Blum, M., Floyd, R.W., Pratt, V.R., Rivest, R.L. and
could be initiated to handle the tasks remaining. Obviously, one Tarjan, R.E. (1972) “Time Bounds for Selection,“,
Journal of Computer and System Sciences, Vol. 7,448
would have to modify the timing of the transfer phase somewhat to - 461.
allow for such a scheme. Slightly more elaborate approaches could BRAT84 Bratsbergsengen K. (1984) “Hashing Methods and Re-
lational Algebra Operations,” Proceedings of the 10th
also be devised. For example, the type 2 tasks are more deterministic International Conference on Very Large Databases.
than the type 1 tasks, so some of the smaller of these type 2 tasks CHR183 Christodoulakis S. (1983) “Estimating Record Selectiv-
ities,“‘Inrrmafion .Sysfem.r,Vol. 8, No. 2, 105 - 115.
might be executed last. COFF73 Coffman E. and Demting P.J. (1973) Operating Sys-
fear Theory, Prentice Hall.
COFF78 Coffman E., Carey M. and Johnson, D.S. (1978) “An
5. Summary: ADDhcation of Bin Pa&m? to Multiorocessor Sched-
r&g,’ SIAM Journal of &nputing, Vol. 7, l-17.
CORN86 Cornell. D.W.. Dias. D.M.. and Yu. P.S. 11986) “On
Conventional parallel join algorithms perform poorly in the pres- Multi-System ‘Coupling through Function Request
Shipping,” ZEEE Trans. Software Eng., Vol. SE-12, No.
ence of data skew. In this paper, we propose a parallel sort merge 10, 1006-1017.
join algorithm which can effectively handle the data skew problem. DEMUBS Demurjian S.A., Hsiao D.K., Kerr D.S., Menon J.,
Strawscr P.R., Tekampe R.C., Trimble J. and Watson
The proposed algorithm introduces a scheduling phase in addition to R.J. (1985) ‘Performance Evaluation of a Database
the usual sort, transfer and join phases. During the scheduling phase, System in Multiple Backend Configurations,” Pro-
a pamllelizable optiruhation algorithm, using the output of the sort ceedings of the 1985 international Workshop on Data-
base Machines, Springer Verlag.
phase, attempts to balance the load across the multiple processors in DEW185 Dewitt, D.J. and Gerber R.H. (1985) “Multiorocessor
the subsequent join phase. Two basic optimization techniques are Hash-based join algorithms,’ Pioceebings of* the 11th
International Conference on Verv Lame Databases.
employed repeatedly. One solves the selection problem, while the DEW186 Dewitt, D.J., Gerber, R.H., Graefe, C?:.,Heytens, M.L.,
other heuristically solves the minimum makespan problem. Our ap- Kumar, K.B. and Muralikrishna, M. (1986)
“GAMMA- A High Performance Datatlow Database
proach naturally identities the largest skew elements and assigns each Machine,” Proceedings of the 12th International Con-
of them to an optimal number of processors. The algorithm is dem- ference on Very Large Databases.
DEW187 Dewitt, D.J., Smith M. and Boral H. (1987) “A
onstrated to achieve very good load balancing for the join phase in a Single-User Performance Evaluation of the Teradata
Database Machine,” h4 CC Technical Report
CPU-bound environment and to be very robust relative to the degree 88-081-87.
of data skew and the number of processors. A Zipf-l&e distribution FRED82 Frederickson G. and Johnson D.B. (1982) +The Com-
plexity of Selection and Ranking in X + Y and Matrices
is used to model the data skew. Although we assume that the system with Sorted Columns,” Journal of Computer and Sys-
is CPU-bound and thus that a CPU pathlength estimate bc used in tem Sciences, Vol. 24, 197-209.
FRED84 Frederickson G. and Johnson D.B. (1984) “Generalized
the objective function, we expect other environments can be similarly Selection and Ranking: Sorted Matrices,” SIAM Jour-
handled merely by employing a different objective function. nal of Comoutina. Vol. 13. 14-30.
GAL179 Galii Z. and M&iddo N. ‘( 1979) “A Fast Selection Al-
gorithm and the Problem of Optimum Distribution of
Effort,” Journal of the ACM, Vol. 26, 1979, 58-64.
A hash join version of our parallel join algorithm has also been GRAl-169 Graham R. (1969) “Bounds on Multiprocessing Timing
devised lYOLF901. We are in the process of testing our algorithm Anomalies,” SIAM Journal of Appl. Math., Vol. 17,
No. 2, 416-429.
against real databases. We have also devised improved type J task lJSIA83 Hsiao D.K. (1983) Advanced Database Machine Archi-
estimation techniques, and are examining methods to speed up the tecture, Prentice-Hall.
lBAR88 Ibamki, T. and Katoh, N. (1988) Resource Allocation
scheduling phase. Relatively small changes to our current approach Problems. MIT Press.
114
IYERSB Iyer, B.R., and Dias, D.M. (1988) “System Issues in
Parallel Sorling for Database Systems,” IBM Research
Reoort RJ 6585.
IYER89 Iy&, B.R., Ricard, G.R. and Varman, P.J. (1989)
“Percentile Fhdii~ Algorithm for Multi& Sorted
Runs,” Proceedings-of rhi 15th Infernationai’Confirence
on Very Large Databases .
KITS83 Kitsuregawa, M., Tanaka, II. and Motooka, T., Appli-
cafion of Hash to Data Bare Machine and its Archilec-
lure New Generation Computing, Vol. 1, No. 1, 1983.
KNUT73 Knuth D.E. (1973) The Art of Cornpurer Program-
ming, Volume 3: Sorting and Searching, Addison-
Wesley.
LAKS88 Lakshmi S. and Yu P.S. (1988) “Effect of Skew on Join
Performance in Parallel Architectures,” Proceedings
Intl. Symposium on Databases in Parallel and Distrib-
uted Syslemc.
LAKS89a lakshmi S. and Yu P.S. (1989) “Limiting Factors of
Join Performance on Parallel Processors,” Proceedings
oflhe 5th Intl. Conf: on Dala Engineering.
LAKS89b Lakshmi S. and Yu P.S. (1989) “Analysis of Parallel
Processing Architectures for Database System,” Pro-
ceedings of the I989 Intf. Con/: on Parallel Processing.
LYNC88 Lynch C.A. (1988) ‘Selectivity Estimation and Query
Optimization in Large Databases with Highly Skewed
Distributions of Column Values,” Proceedings of the
14th International Conference on Very Large
Databates.
MONT83 Montgomery A.Y., D’Souza D.J. and Lee S.B. (1983)
The Cost of Relational Algebraic Operations on
Skewed Data: Estimates and Experiments,” Informalion
Processing 83, IFIP.
NECHM Neches P.M. and Shemer J.E. (1984) “The Genesis of
a Database Computer,” IEEE Compufer , Vol. 17, No.
11,42- 56.
OZKA86 Ozkarahan E. ( 1986) Dalabase Machines and Database
Management, &entice Hall.
QADA85 Qadah, G.Z. (1985) “The Equi-Join Operation on a
Multiprocessor Database Machine: Algorithms and the
Evaluation of their Performance,’ Proceedings of the
1985 Inlemational Workshoo on Database Machines,
Springer Verlag, 35 - 67. ’
SALZ83 Salza S., Tenanova M. and Velardi P. (1983) “Per-
formance Modelinn of the DBMAC Architecture,”
Proceedings of the- 1983 International Workshop &
Database Machines, Springer-Verlag, 74 - 90.
SCHN89 Schneider, D.A. and Dewitt, D.J. (1989) “A Perfoxm-
ante Evaluation of Four Parallel Join Algorithms in a
Shared-Nothing Multiprocessor Environment,” A CM
Sigmod Conference, Portland, Oregon. 110 - 121.
STON86 Stonebraker. M. (1986) “The Case for Shared Noth-
ing,” tEEE batab&e E&ineering, Vol. 9, No. I.
TANT88 Tantawi, A.N., Towsley, D. and Wolf, J. (1988) “Op-
timal Allocation of Multiple Class Resources in Com-
puter Systems,” ACM Sigmetrics Conference, Sanla Fe,
New Mexico. 253 - 260.
VALD84 Valduriez P.‘and Gardarin G. (1984) “Join and Semi-
Join Algorithms for a Multiprocessor Database Ma-
chine,” ACM Transactions on Database Systems, Vol.
9. No. 1, 133 - 161.
WOLF90 Wolf, J.L., Dias, D.M., and Yu, P.S., “An Effective
Algorithm for Parallelizing Hash Joins in the Presence
of Data Skew”, IBM Research Report RC 15510, 1990.
YU87 Yu, P.S., Dias, D.M., Robinson, J.T., lyer, B.R., and
Cornell, D.W. (1987) “On Coupling Multi-Systems
Through Data Sharing”, Proceedings of the IEEE, Vol.
75, No. 5, 573-587.
Z1PF49 Zipf G.K. (1949) Human Behavior and the Principle of
Least Efforts Addison-Wesley.
115