Professional Documents
Culture Documents
1999 (125) - Heterogeneous Task Scheduling Algorithms
1999 (125) - Heterogeneous Task Scheduling Algorithms
1999 (125) - Heterogeneous Task Scheduling Algorithms
Haluk Topcuoglu
Department of Electrical Engineering and Computer Science
Syracuse University, Syracuse, NY 13244-4100
haluk@top.cis.syr.edu
Salim Hariri
Department of Electrical and Computer Engineering
The University of Arizona, Tucson, Arizona 85721-0104
Min-You Wu
Department of Electrical and Computer Engineering
University of Central Florida
In some previous algorithms the level attribute is The Critical-Path-on-a-Processor (CPOP) Algo-
used to set the priorities of the tasks. The level of a rithm, shown in Figure 2, is another heuristic for
task is computed by the maximum number of edges scheduling tasks on a bounded number of heteroge-
along any path to the task from the start node. The neous processors. The ranku and rankd attributes of
start node has a level of zero. nodes are computed using mean computation and com-
munication costs. The critical path nodes (CP Ni) are
3. Proposed Algorithms determined at Steps 5-6. The critical-path-processor
(CPP) is the one that minimizes the length of the
In this section, we present our scheduling algorithms critical path (Step 7). The CPOP Algorithm uses
the Heterogeneous Earliest Finish Time (HEFT) Al- rankd (ni )+ranku(ni ) to assign the node priority. The
gorithm and the Critical-Path-on-a-Processor (CPOP) processor selection phase has two options: If the cur-
Algorithm. rent node is on the critical path, it is assigned to the
critical path processor (CPP); otherwise, it is assigned
to the processor that minimizes the execution comple-
3.1. The HEFT Algorithm tion time. The latter option is insertion-based (as in
the HEFT algorithm). At each iteration we maintain
The Heterogeneous-Earliest-Finish-Time (HEFT) a priority queue to contain all free nodes and select the
Algorithm, as shown in Figure 1, is a DAG scheduling node that maximizes rankd(ni ) + ranku(ni ). A binary
algorithm that supports a bounded number of hetero- heap was used to implement the priority queue, which
geneous processing elements (PEs). To set priority to has time complexity of O(logv) for insertion and dele-
a task ni , the HEFT algorithm uses the upward rank tion of a node and O(1) for retrieving the node with the
value of the task, ranku (Equation 3), which is the highest priority (the root node of the heap). The time
length of the longest path from ni to the exit node. complexity of the algorithm is O(v2 p) for v nodes
The ranku calculation is based on mean computation and p processors.
and communication costs. The task list is generated by
sorting the nodes with respect to the decreasing order
of the ranku values. In our implementation the ties 4. Related Work
are broken randomly; i.e., if two nodes to be scheduled
have equal ranku values, one of them is selected ran- Only a few of the proposed task scheduling heuris-
domly. tics support variable computation and communica-
tion costs for heterogeneous domains: the Dynamic
The HEFT algorithm uses the earliest nish time Level Scheduling(DLS) Algorithm [4], the Levelized-
value, EFT , to select the processor for each task. In Min Time (LMT) Algorithm [9], and the Mapping
noninsertion-based scheduling algorithms, the earliest Heuristic (MH) Algorithm [5]. Although there are ge-
available time of a processor pj , the T Available[j] netic algorithm based research e orts [6, 10, 14], most
term in Equation 1, is the execution completion time of them are slow and usually do not perform as well as
of the last assigned node on pj . The HEFT Algorithm, the list-scheduling algorithms.
1. Compute rank for all nodes by traversing graph upward, starting from the exit node.
u
7. Assign the task n to the processor p that minimizes the (EFT ) value of n .
i j i
8. end
1. Compute rank for all nodes by traversing graph upward, starting from the exit node.
u
2. Compute rank for all nodes by traversing graph downward, starting from the start node.
d
15. else
16. Assign the task n to the processor p which minimizes the (EFT ) value of n .
i j i
17. Update the priority-queue with the successor(s) of n if they become ready-nodes.
i
18. end
Mapping Heuristic (MH) Algorithm The MH to the mean values, median values are used to compute
Algorithm uses static upward ranks (rankus ) to assign the static upward ranks; and for the EST computation,
priorities to the nodes. A ready node list is kept sorted the noninsertion method is used). At each scheduling
according to the decreasing order of priorities. (For step the algorithm selects the (ready node, available
tie-breaking, the node with the largest number of im- processor) pair that maximizes the dynamic level value.
mediate successors is selected.) With a noninsertion- For heterogeneous environments the (ni; pj ) term is
based method, the processor that provides the mini- added to the dynamic level computation. The value
mum earliest nish time of a task is selected to run the for a task on a processor is computed by the di erence
task. After a task is scheduled, the immediate succes- between the task's median execution time on all pro-
sors of the task are inserted into the list. These steps cessors and its execution time on the current processor.
are repeated until all nodes are scheduled. The time The DLS Algorithm has an O(v3 p f(p)) time com-
complexity of this algorithm is O(v2 p) for v nodes plexity, where v is the number of tasks and p is the
and p processors. number of processors. The complexity of the function
used for data routing to calculate the earliest start time
Dynamic-Level Scheduling (DLS) Algorithm is O(f(p)) [4].
The DLS Algorithm assigns node priorities by using
an attribute called Dynamic Level (DL) that is equal Levelized-Min Time Algorithm. This is a two-
to DL(ni ; pj ) = ranku(ni ) , EST (ni ; pj ). (In contrast phase algorithm. The rst phase orders the tasks based
on their precedence constraints, i.e., level by level. This is computed by assigning all tasks to a single pro-
phase groups the tasks that can be executed in parallel. cessor that minimizes the total computation time
The second phase is a greedy method that assigns each
task (level by level) to the \fastest" available processor
P it is de ned as:
of the DAG). Formally,
minj2P ( ni 2DAG wi;j )
as much as possible. A task in a lower level has higher Speedup = makespan .
priority for scheduling than a node in a higher level; Running Time. This is the average cost of each
within the same level, the task with the highest aver- scheduling algorithm for obtaining the schedules
age computation cost has the highest priority. If the of the given graphs on a Sun SPARC 10 work-
number of tasks in a level is greater than the number station. The trade-o s between the performance
of available processors, the ne-grain tasks are merged (SLR value) and cost (running time) of scheduling
into a coarse-grain task until the number of tasks is algorithms were given in this section.
equal to the number of processors. Then the tasks are
sorted in reverse order (largest task rst) based on av-
erage computation time. Beginning from the largest 5.1. Generating Random Task Graphs
task, each task will be assigned to the processor: a)
that minimizes the sum of computation cost of the task We developed an algorithm to generate random di-
and the communication costs with tasks in the previ- rected acyclic graphs that are used to evaluate the pro-
ous layers; and b) that does not have any scheduled posed scheduling algorithms. The input parameters
task at the same level. For a fully-connected graph, required to generate a weighted random DAG are the
the time complexity is O(p2v2 ) when there are v tasks following:
and p processors.
Number of nodes (tasks) in the graph, v.
5. Performance and Comparison Shape parameter of the graph, . We assume that
the height of a DAG is randomly generated from a
In this section we present the performance compar- uniform distribution with mean equal to pv.
ison of our algorithms with the related work, using The width for each level in a DAG is randomly
the randomly generated task graphs and regular task selected from
p a uniform distribution with mean
graphs representing the applications. The following equal to v . If = 1:0, then it will be a balanced
metrics were used to compare the performances of our DAG. A dense DAG (a shorter graph with high
proposed algorithms with the previous approaches. parallelism) can be generated by selecting >>
Schedule Length Ratio (SLR). The SLR of an 1:0. Similarly, if << 1:0, it will generate a longer
algorithm is de ned as DAG with a small parallelism degree.
SLR = P makespan Out degree of a node, out degree. One method is
ni 2CPMIN minj2P fwi;j g
where makespan is the schedule length of the to use a xed out degree value for all nodes in a
algorithm's output schedule. The denominator DAG as much as possible. Another alternative is
is the summation of computation costs of nodes to use a mean out degree. Then, each node's out-
on the CPMIN . (For an unscheduled DAG, if degree will be randomly generated from a uniform
the computation cost of each node ni is set with distribution with mean equal to out degree.
fminj 2P fwi;j gg, then the resulting critical path,
CPMIN , will be based on minimum computation Communication to computation ratio, CCR.
values). The SLR value of any algorithm for a CCR is the ratio of the average communication
DAG cannot be less than one, since the denomina- cost to the average computation cost. If a DAG's
tor in the equation is the lower bound for comple- CCR value is low, it can be considered as a
tion time of the graph. The algorithm that gives computation-intensive application; if it is high, it
the lowest SLR value of a DAG is the best algo- can be considered as a communication-intensive
rithm with respect to performance (i.e., the one application.
that gives the minimum overall execution time).
Average computation cost in the graph, avg comp.
Speedup. The speedup value of an algorithm Computation costs are generated randomly from
is computed by dividing the overall computation a uniform distribution with mean equal to
time of the sequential execution to the makespan avg comp. Similarly, the average communication
of the algorithm. (The sequential execution time cost is derived as avg comm = CCR avg comp.
Range percentage of computation costs on proces- same performance ranking of the algorithms.
sors, . A high value causes signi cant dif-
ference of a node's computation cost among the
processors; a very low value means that the ex-
pected execution times of a node on any given pro-
8
Average SLR
wi (1 + 2 ), where is the range percentage. 5
2.4
2.2
Each ranking starts with the best algorithm and ends 1.2
HEFT
CPOP
MH
DLS
15 LMT
Average Running Time (in msec.)
10
5 HEFT
2.5 CPOP
MH
DLS
LMT
0
20 30 40 50 60 70 80 90 100
Number of Nodes 2
SLR
Figure 4. Average Running Time of the Algo-
rithms with Respect to Graph Size
1.5
Performance Study with Respect to CCR Figure 5. Average SLR of the Algorithms with
Respect to CCR: (a) 0:1 CCR 1:0 (b)
Figure 5 gives the average SLR of the algorithms 1:0 CCR 10:0
when the CCR value in [0:1; 1:0] and [1:0; 5:0] with
number of processors is equal to 10. When CCR
0:25, the HEFT Algorithm outperforms the other al-
gorithms. If CCR 0:2, the DLS algorithm gives a
better performance than the HEFT Algorithm. The
performance ranking of the algorithms (when CCR
1:0) is fHEFT, DLS, MH, CPOP, LMTg. When
CCR > 1:0 the performance ranking changes to
fHEFT, CPOP, DLS, MH, LMTg.
5.2. Applications and multiplication, or of a division).
In this section we present a performance comparison
of the scheduling algorithms based on Gaussian elimi-
T1,1
nation and Fast Fourier Transformation (FFT). For the T1,2 T1,3 T1,4 T1,5
nodes and communication costs of the edges. The num- for k=1 to n-1 do T2,3 T2,4 T2,5
be characterized by the input matrix size. For FFT, the for j = k+1 to n do
T3,3
3
LMT but it is greater than the LMT algorithm. The cost
ranking of the algorithms (starting from the lowest)
Average SLR
HEFT
CPOP
MH
0.8 DLS
LMT
1
0.6
Efficiency
0.5
0.4
0
0.2 2 4 8 16
Number of Processors
0
2 4 8 16 Figure 8. Average Running Time for Each
Number of Processors
Algorithm to Schedule Gaussian Elimination
Graphs
(b)
T1
HEFT
6
CPOP
T2 T3 MH
FFT (A, ω ) RCN DLS
1. n = length(A) 5 LMT
2. if (n=1) return (A)
3. Y (0) = FFT( (A[0],A[2],...A[n-2]) , ω )
2
T4 T5 T6 T7
(1)
4. Y = FFT( (A[1],A[3],...A[n-1]) , ω2 )
Average SLR
4
5. for i = 0 to n / 2 -1 do
6. { Y[i] = Y (0) [i] + ωi * Y(1) [i]
T10 T11
Y[i+n/2] = Y [i] - ωi * Y [i] }
(0) (1) T8 T9
7.
BON 3
8. return (Y)
2
a) b)
)
6. Conclusion and Future Work Figure 10. (a) Average SLRs of Scheduling
Algorithms at Various Graph Sizes for FFT
In this paper we have proposed two task schedul- Graph (b) Efficiency Comparison of the Al-
ing algorithms (the HEFT Algorithm and the CPOP gorithms
Algorithm) for heterogeneous processors. For the gen-
erated random task graphs and the task graphs of
for selected applications, the HEFT algorithm outper-
forms the other algorithms in all performance metrics,
i.e., average schedule length ratio (SLR), speedup, and
time-complexity. Similarly, the CPOP Algorithm is
better than, or at least comparable to, the existing al-
gorithms. Both algorithms perform more stably than
the others in terms of scheduling quality and running
time.
We are extending our algorithms to improve their
performance at speci c CCR ranges and computing
power weight ranges. Additionally, we plan to add low-
cost, local-search techniques to improve the scheduling
quality of our algorithms. Future work will include
di erent techniques for ordering the ready tasks and
extensions for the processor selection phase.
0.05
HEFT
CPOP
MH
0.04 DLS
References
LMT
Running Time (in sec.)
0.03
1
Time Scheduling Heuristic for Interconnection-
0.8 Constrained Heterogeneous Processor Architec-
tures," IEEE Transactions on Parallel and Dis-
tributed Systems, vol. 4, pp. 175{186, Feb. 1993.
0.6
0.4
[5] H. El-Rewini and T. G. Lewis, \Scheduling Par-
0.2
allel Program Tasks onto Arbitrary Target Ma-
0 chines," Journal of Parallel and Distributed Com-
2 4 8
Number of Processors
16 32
puting, vol. 9, pp. 138{153, 1990.
[6] H. Singh, A. Youssef, \Mapping and Schedul-
(b)
ing Heterogeneous Task Graphs Using Genetic Al-
gorithms," Heterogeneous Computing Workshop,
Figure 11. Running Times of the Scheduling 1996.
Algorithms for FFT Graphs [7] I. Ahmad and Y. Kwok, \A New Approach to
Scheduling Parallel Programs Using Task Dupli-
cation," Proc. of Int'l Conference on Parallel Pro-
cessing, vol. II, pp. 47{51, 1994.
[8] M. A. Palis, J. Liou, and D. S. L. Wei, \Task
Clustering and Scheduling for Distributed Mem-
ory Parallel Architectures," IEEE Transactions
on Parallel and Distributed Systems, vol. 7, 46{
55, Jan. 1996.
[9] M. Iverson, F. Ozguner, G. Follen, \Paralleliz- Biographies
ing Existing Applications in a Distributed Hetero-
geneous Environments," Proc. of Heterogeneous Haluk Topcuoglu is a Ph.D. candidate in the De-
Computing Workshop, 93{100, 1995. partment of Electrical Engineering and Computer Sci-
ence at Syracuse University, New York, USA. His re-
[10] P. Shro , D. W. Watson, N. S. Flann, and R. Fre- search interests include task scheduling techniques in
und, \Genetic Simulated Annealing for Schedul- heterogeneous environments, metacomputing and clus-
ing Data-Dependent Tasks in Heterogeneous En- ter computing issues, parallel and distributed program-
vironments," Proc. of Heterogeneous Computing ming, and web technologies. He received his B.S. and
Workshop, 1996. M.S. degrees in 1991 and 1993, respectively, in com-
puter engineering from Bogazici University, Istanbul,
[11] T. Yang and A. Gerasoulis, \A Comparison Turkey. Mr. Topcuoglu is a member of ACM, IEEE,
of Clustering Heuristics for Scheduling Directed IEEE Computer Society, and the Phi Beta Delta Honor
Acyclic Graphs on Multiprocessors," Journal of Society.
Parallel and Distributed Computing, vol. 16, 276{
291, 1992. Salim Hariri is currently a Professor in the Depart-
ment of Electrical and Computer Engineering at The
[12] T. Yang and A. Gerasoulis. \DSC: Scheduling Par- University of Arizona. Dr. Hariri received his Ph.D in
allel Tasks on an Unbounded Number of Proces- computer engineering from University of Southern Cal-
sors," IEEE Transactions on Parallel and Dis- ifornia in 1986, an M.S. degree from The Ohio State
tributed Systems, vol. 5, pp. 951{967, Sept. 1994. University, in 1982, and B.S. in Electrical Engineer-
ing from Damascus University, in 1977. His current
[13] Y. Kwok and I. Ahmad, \A Static Schedul- research focuses on high performance distributed com-
ing Algorithm Using Dynamic Critical Path For puting, design and analysis of high speed networks,
Assigning Parallel Algorithms onto Multiproces- benchmarking and evaluating parallel and distributed
sors," Proc. of Int'l Conference on Parallel Pro- systems, and developing software design tools for high
cessing, vol. II, pp. 155{159, 1994. performance computing and communication systems
and applications. Dr. Hariri is the Editor-In-Chief for
[14] L. Wang, H. J. Siegel, and V. P. Roychowdhury, the Cluster Computing Journal. He served as the Pro-
\A Genetic-Algorithm-Based Approach for Task gram Chair and the General Chair of the IEEE Inter-
Matching and Scheduling in Heterogeneous Com- national Symposium on High Performance Distributed
puting Environments," Proc. of Heterogeneous Computing (HPDC).
Computing Workshop, 1996.
Min-You Wu received the M.S. degree from the Grad-
[15] T. H. Cormen, C. E. Leiserson, and R. L. Rivest, uate School of Academia Sinica, Beijing, China, and
Introduction to Algorithms, The MIT Press, 1990. the Ph.D. degree from Santa Clara University, Cali-
fornia. Before he joined the Department of Electri-
[16] Y. Chung and S. Ranka, \Applications and Per- cal and Computer Engineering, University of Central
formance Analysis of a Compile-Time Optimiza- Florida, where he is currently an Associate Professor,
tion Approach for List Scheduling Algorithms on he has held various positions at University of Illinois at
Distributed Memory Multiprocessors," Proc. of Urbana-Champaign, University of California at Irvine,
Supercomputing, pp. 512{521, Nov. 1992. Yale University, Syracuse University, and State Uni-
versity of New York at Bu alo. His research interests
[17] M. Cosnard, M. Marrakchi, Y. Robert, D. Trys- include parallel and distributed systems, compilers for
tram, \Parallel Gaussian Elimination on an parallel computers, programming tools, VLSI design,
MIMD Computer," Parallel Computing, vol. 6, and multimedia systems. He has published over 70
pp. 275-295, 1988. journal and conference papers in the above areas and
edited two special issues on parallel operating systems.
[18] Y. Yan, X. Zhang, Y. Song, \An E ective Perfor- He is a member of ACM and a senior member of IEEE.
mance Prediction Model for Parallel Computing He is listed in International Who's Who of Information
on Non-dedicated Heterogeneous NOW," Journal Technology.
of Parallel and Distributed Computing, vol. 38, pp.
63-80, 1996.