Parallel Computing Unit 3 - Principles of Parallel Computing Design

Unit 3 – Principles of Parallel
Algorithm Design
Parallel Computing
2022 Parallel Computing | UNITEN 1

Objectives
• To identify the suitable problem for parallel solutions
• To differentiates the decomposition techniques in parallel computing
• To differentiates the mapping techniques in parallel computing
• To differentiates the parallel algorithm design models

Parallel Algorithm
• Recipe to solve a problem using multiple processors
• Typical steps for constructing a parallel algorithm
• identify what pieces of work can be performed concurrently
• partition concurrent work onto independent processors
• distribute a program’s input, output, and intermediate data
• coordinate accesses to shared data: avoid conflicts
• ensure proper order of work using synchronization
• Why “typical”? Some of the steps may be omitted.
• if data is in shared memory, distributing it may be unnecessary
• if using message passing, there may not be shared data
• the mapping of work to processors can be done statically by the programmer or
dynamically by the runtime

First step :Understand the Problem and the
Program
• Undoubtedly, the first step in developing parallel software is to first understand the problem that you wish
to solve in parallel. If you are starting with a serial program, this necessitates understanding the existing
code also.
• Before spending time in an attempt to develop a parallel solution for a problem, determine whether or not
the problem is one that can actually be parallelized.
• Example of an easy to parallelize problem: Calculate the potential energy for each of
several thousand independent
conformations of a molecule. When done,
find the minimum energy conformation.
• This problem is able to be solved in parallel. Each of the molecular conformations is independently
determinable. The calculation of the minimum energy conformation is also a parallelizable problem.
• Example of a problem with little-to-no parallelism: Calculation of the Fibonacci series
(0,1,1,2,3,5,8,13,21,...)
by use of the formula:
F(n) = F(n-1) + F(n-2)
• The calculation of the F(n) value uses those of both F(n-1) and F(n-2), which must be computed first.
Identify Potential Areas to Parallel
• Identify the program's hotspots:
• Know where most of the real work is being done. The majority of scientific and technical programs usually
accomplish most of their work in a few places.
• Profilers and performance analysis tools can help here
• Focus on parallelizing the hotspots and ignore those sections of the program that account for little CPU usage.
• Identify bottlenecks in the program:
• Are there areas that are disproportionately slow, or cause parallelizable work to halt or be deferred? For
example, I/O is usually something that slows a program down.
• May be possible to restructure the program or use a different algorithm to reduce or eliminate unnecessary
slow areas
• Identify inhibitors to parallelism
• One common class of inhibitor is data dependence, as demonstrated by the Fibonacci sequence above.
• Investigate other algorithms if possible
• This may be the single most important consideration when designing a parallel application.
• Take advantage of optimized third party parallel software and highly optimized math libraries available from
leading vendors (IBM's ESSL, Intel's MKL, AMD's AMCL, etc.).

Terms & Definitions
• The two key steps in the design of parallel algorithms:
• Dividing a computation into smaller computations and
• assigning them to different processors for parallel execution
• The process of dividing a computation into smaller parts, some or all of
which may potentially be executed in parallel, is called decomposition.
• Tasks are programmer-defined units of computation into which the main
computation is subdivided by means of decomposition.
• Simultaneous execution of multiple tasks is the key to reducing the time
required to solve the entire problem.
• The tasks into which a problem is decomposed may not all be of the same
size.

Decomposing Problems to Tasks
• Divide work into tasks that can be executed concurrently
• Many different decompositions possible for any computation
• Tasks may be same, different, or even indeterminate sizes
• Tasks may be independent or have non-trivial order
• Conceptualize tasks and ordering as task dependency directed graph
node = task
edge = control dependence

Graphs for Task Decomposition
• A decomposition can be illustrated in the form of a directed graph
with nodes corresponding to tasks and edges indicating that the
result of one task is required for processing the next. Such a graph is
called a task dependency graph.
• The graph of tasks (nodes) and their interactions/data exchange
(edges) is referred to as a task interaction graph.
• Note that task interaction graphs represent data dependencies,
whereas task dependency graphs represent control dependencies.

Example: Multiplying a *Dense Matrix with a
Vector
* Dense matrix – if all or most of
the elements are nonzero
Computation of each element of output vector y is independent of other elements. Based on this, a dense
matrix-vector product can be decomposed into n tasks. The figure highlights the portion of the matrix and
vector accessed by Task 1.
Observations: While tasks share data (namely, the vector b ), they do not have any control
dependencies - i.e., no task needs to wait for the (partial) completion of any other. All tasks are of
the same size in terms of number of operations. Is this the maximum number of tasks we could
decompose this problem into?
Example: Database Query Processing
• Consider the execution of the query:
MODEL = ``CIVIC'' AND YEAR = 2001 AND
(COLOR = ``GREEN'' OR COLOR = ``WHITE)
on the following database:
ID# Model Year Color Dealer Price
4523 Civic 2002 Blue MN $18,000
3476 Corolla 1999 White IL $15,000
7623 Camry 2001 Green NY $21,000
9834 Prius 2001 Green CA $18,000
6734 Civic 2001 White OR $17,000
5342 Altima 2001 Green FL $19,000
3845 Maxima 2001 Blue NY $22,000
8354 Accord 2000 Green VT $18,000
4395 Civic 2001 Red CA $17,000
7352 Civic 2002 Red WA $18,000
• The execution of the query can be divided into subtasks in various ways.
• Task: compute set of elements that satisfy a predicate
• task result = table of entries that satisfy the predicate
• Edge: output of one task serves as input to the next ID# Year
7623 2001
MODEL = "CIVIC" AND YEAR = ID# Model
9834 2001
ID# Color
4523 Civic 6734 2001 7623 Green
ID# Model Year Color Dealer Price 2001 AND 6734 Civic 5342 2001
ID# Color
9834 Green
4523Civic 2002 Blue MN $18,000 4395 Civic 3845 2001 3476 White 5342 Green
(COLOR = "GREEN" OR 7352 Civic 4395 2001 6734 White 8354 Green
3476Corolla 1999 White IL $15,000
COLOR = "WHITE")
7623Camry 2001 Green NY $21,000
9834Prius 2001 Green CA $18,000 ID# Color

6734Civic 2001 White OR $17,000 ID# Model Year 3476 White
6734 White
5342Altima 2001 Green FL $19,000 6734 Civic 2001
7623 Green
4395 Civic 2001
3845Maxima 2001 Blue NY $22,000 9834 Green
5342 Green
8354Accord 2000 Green VT $18,000 8354 Green
4395Civic 2001 Red CA $17,000

7352Civic 2002 Red WA $18,000 ID# Model Year Color
6734 Civic 2001 White

• Alternate task decomposition for query
MODEL = "CIVIC" AND YEAR = 2001 AND (COLOR = "GREEN" OR COLOR = "WHITE")
ID# Year
ID# Model 7623 2001 ID# Color
ID# Model Year Color Dealer Price
4523 Civic 9834 2001 7623 Green
4523Civic 2002 Blue MN $18,000 6734 Civic 6734 2001 ID# Color 9834 Green
5342 2001
4395 Civic 3476 White 5342 Green
3476Corolla 1999 White IL $15,000 3845 2001
7352 Civic 6734 White 8354 Green
4395 2001
7623Camry 2001 Green NY $21,000
9834Prius 2001 Green CA $18,000 ID# Color
6734Civic 2001 White OR $17,000 3476 White
6734 White
5342Altima 2001 Green FL $19,000 7623 Green
3845Maxima 2001 Blue NY $22,000 9834 Green
5342 Green
8354Accord 2000 Green VT $18,000 ID# Year Color 8354 Green
7623 2001 Green
4395Civic 2001 Red CA $17,000 9834 2001 Green
6734 2001 White
7352Civic 2002 Red WA $18,000
5342 2001 Green
ID# Model Year Color
6734 Civic 2001 White
Different decompositions may yield different parallelism and different amounts of work
Granularity of Task Decompositions
• Granularity = task size
• The amount of work associated with
parallel tasks between synchronization/
communication pointS
• Fine-grain = decompose to large
number of tasks
• Coarse-grain = decompose to small
number of tasks
• Granularity examples for dense
matrix-vector multiply
• fine-grain: each task represents an
individual element in y
• coarser-grain: each task computes 3
elements in y

Degree of Concurrency
• The number of tasks that can be executed in parallel is the degree of
concurrency of a decomposition.
• Since the number of tasks that can be executed in parallel may change over
program execution, the maximum degree of concurrency is the maximum
number of such tasks at any point during execution.
• What is the maximum degree of concurrency of the database query examples?
• The average degree of concurrency is the average number of tasks that can
be processed in parallel over the execution of the program.
• Assuming that each tasks in the database example takes identical processing time,
what is the average degree of concurrency in each decomposition?
• The degree of concurrency increases as the decomposition becomes finer
in granularity and vice versa. Degree of concurrency vs. task granularity is
an inverse relationship.

Degrees of concurrency
• Maximum degree of concurrency
• the maximum number of tasks that can be executed in parallel
• Average degree of concurrency
• the average number of tasks that can be executed in parallel
• more useful measure

Critical Path Length
• A directed path in the task dependency graph represents a sequence
of tasks that must be processed one after the other.
• The longest such path determines the shortest time in which the
program can be executed in parallel.
• The length of the longest path in a task dependency graph is called
the critical path length.

Critical Path
• Start nodes: nodes with no incoming edges
• Finish nodes: nodes with no outgoing edges
• Critical path: the longest directed path between any pair of start and
finish nodes
• Critical path length: sum of the weights of the nodes on a critical path
• Average degree of concurrency = total amount of work
critical path length

MODEL = "CIVIC" AND YEAR = 2001 AND
Critical Path Length (COLOR = "GREEN" OR COLOR = "WHITE")
Decomposition → Task Dependency Graph

Task 1 Task 2 Task 3 Task 4
10 10 10 10
Task 5 10 Task 6 6
Task 7 8
• Start node: Task 1, Task 2, Task 3, Task 4 • Average degree of concurrency = (10+10+10+10+10+6+8)/28
= 64/28 = 2.29
• Finish node: Task 7
• Maximum degree of concurrency = 4 (Task 1, Task 2, Task 3,
• Critical path: Task 1, Task 5, Task 7 and Task 2, Task 5, Task 7 Task 4)
• Critical path length: 10+10+8 = 28
MODEL = "CIVIC" AND YEAR = 2001 AND
Critical Path Length (COLOR = "GREEN" OR COLOR = "WHITE")

Task 1 Task 2 Task 3 Task 4
10 10 10 10
Task 5 6
Task 6 12
Task 7 8
• Start node: Task 1, Task 2, Task 3, Task 4 • Average degree of concurrency = (10+10+10+10+6+12+8)/36 = 66/36 =
1.83
• Finish node: Task 7
• Maximum degree of concurrency = 4 (Task 1, Task 2, Task 3, Task 4)
• Critical path: Task 3, Task 5, Task 6, Task 7 and Task 3, Task 5, Task 6, Task 7
• Critical path length: 10+6+12+8 = 36

y=Axb
Critical Path Length
Task 1 Task 2
A b y
4 4
Task 1
Task 2
Task 3 8
𝐴1,1 × 𝑏1,1 + 𝐴1,2 × 𝑏1,2
𝑦=
𝐴2,1 × 𝑏2,1 + 𝐴2,2 × 𝑏2,2
• Start node: Task 1, Task 2 • Average degree of concurrency = (4+4+8)/12 = 16/12 = 1.33
• Finish node: Task 3 • Maximum degree of concurrency = 2 (Task 1, Task 2)
• Critical path: Task 1, Task 3 and Task 2, Task 3
• Critical path length: 4+8 = 12
Limits on Parallel Performance
• It would appear that the parallel time can be made arbitrarily small by
making the decomposition finer in granularity.
• There is an inherent bound on how fine the granularity of a computation
can be.
• For example, in the case of multiplying a dense matrix with a vector, there can be no
more than (n2) concurrent tasks.
• Concurrent tasks may also have to exchange data with other tasks. This
results in communication overhead.
• The tradeoff between the granularity of a decomposition and associated
overheads often determines performance bounds.
• Fraction of application work that can’t be parallelized –as shown by
Amdahl’s law, also limits parallel performance

Task Interaction Graphs
• Tasks generally exchange data with others
• example: dense matrix-vector multiply
• if vector b is not replicated in all tasks, tasks will have to communicate
elements of b
• The graph of tasks (nodes) and their interactions/data exchange
(edges) is referred to as a task interaction graph.
• Note that task interaction graphs represent data dependencies,
whereas task dependency graphs represent control dependencies.
• In task interaction graph
• node = task
• edge = interaction or data exchange

Task Interaction Graph Example
Sparse matrix-vector multiplication
• Computation of each result
element = independent task
• Only non-zero elements of
sparse matrix A participate
• If, b is partitioned among tasks …
• structure of the task interaction
graph = graph of the matrix A
• (i.e. the graph for which A
represents the adjacency
structure)

Task Interaction Graphs, Granularity, and
Communication
• In general, if the granularity of a decomposition is finer, the associated
overhead (as a ratio of useful work associated with a task) increases.
• Example: Consider the sparse matrix-vector product example from
previous slide. Assume that each node takes one unit time to process and
each interaction (edge) causes an overhead of one unit time.
• Viewing node 0 as an independent task involves a useful computation of
one time unit and overhead (communication) of three time units.
• Now, if we consider nodes 0, 4, and 5 as one task, then the task has useful
computation totaling to three time units and communication
corresponding to four time units (four edges). Clearly, this is a more
favorable ratio than the former case.

Interaction Graphs, Granularity, &
Communication
• Finer task granularity increases communication overhead
• Example: sparse matrix-vector product interaction graph
• Assumptions:
• each node takes unit time to process
• each interaction (edge) causes an overhead of a unit time
• If node 0 is a task: Communication
• communication = 3
Computation
• computation = 4
• 0-0,0-1,0-4,0-8
• If nodes 0, 4, and 5 are a task:
• communication = 5;
• computation = 15
• 0-0,0-1,0-4,0-8 = 4
• 4-4,4-0,4-5,4-9,4-8 = 5
• 5-5,5-1,5-2,5-6,5-9,5-4 = 6
• coarser-grain decomposition → smaller
communication/computation
Communication Computation

Decomposition
techniques

Decomposition Techniques
How should one decompose a task into various subtasks?
• No single universal recipe
• In practice, a variety of techniques are used including
• recursive decomposition
• data decomposition
• exploratory decomposition
• speculative decomposition

Recursive Decomposition
• Suitable for problems solvable
using divide-and-conquer
• Steps
• decompose a problem into a set of
sub-problems
• recursively decompose each sub-
problem
• stop decomposition when
minimum desired granularity
reached

Recursive Decomposition for Quicksort
• A classic example of a divide-
and-conquer algorithm on which
we can apply recursive
decomposition is Quicksort.
• In this example, once the list has
been partitioned around the
pivot, each sublist can be
processed concurrently (i.e.,
each sublist represents an
independent subtask). This can
be repeated recursively.

Recursive Decomposition for Quicksort
• Sort a vector v:
• Choose the first index
• Select a pivot
• Partition v around pivot into vleft
and vright
• In parallel, sort vleft and sort
vright

Recursive Decomposition to find minimum number
• The problem of finding the 1. procedure SERIAL_MIN (A,n)
minimum number in a given 2. begin
list (or indeed any other
3. min = A[0];
associative operation such 4. for i:= 1 to n−1 do
as sum, AND, etc.) can be
fashioned as a divide-and- 5. if (A[i] <min) min := A[i];
conquer algorithm. The 6. endfor;
following algorithm illustrates 7. return min;
this. 8. end SERIAL_MIN
• We first start with a simple
serial loop for computing the
minimum entry in a given list:

• We can rewrite the 1. procedure RECURSIVE_MIN (A, n)
2. begin
loop as follows: 3. if ( n = 1 ) then
4. min := A [0] ;
5. else
6. lmin := RECURSIVE_MIN ( A, n/2 );
7. rmin := RECURSIVE_MIN ( &(A[n/2]), n -
n/2 );
8. if (lmin < rmin) then
9. min := lmin;
10. else
11. min := rmin;
12. endelse;
13. endelse;
14. return min;
15. end RECURSIVE_MIN

• The code in the previous slide can be decomposed naturally using a
recursive decomposition strategy. We illustrate this with the following
example of finding the minimum number in the set {4, 9, 1, 7, 8, 11, 2,
12}. The task dependency graph associated with this computation is as
follows:

Data Decomposition
• Steps
1. identify the data on which computations are performed
2. partition the data across various tasks
• partitioning induces a decomposition of the problem
• Data can be partitioned in various ways
• appropriate partitioning is critical to parallel performance
• Decomposition based on
• input data
• output data
• input + output data
• intermediate data

Data Decomposition: Based on Input Data
• Applicable if each output is computed as a function of the input
• May be the only natural decomposition if output is unknown
• examples
• finding the minimum in a set or other reductions
• sorting a vector
• Associate a task with each input data partition
• task performs computation on its part of the data
• subsequent processing combines partial results from earlier tasks

Data Decomposition: Based on
Input Data
• The problem of computing the
frequency of a set of itemsets in a
transaction database
• Decomposition based on a
partitioning of the input set of
transactions.
• Each of the two tasks computes the
frequencies of all the itemsets in its
respective subset of transactions.
• The two sets of frequencies, which
are the independent outputs of the
two tasks, represent intermediate
results.
• Combining the intermediate results
by pairwise addition yields the final
result.

Data Decomposition: Based on Input Data
• Count the frequency of item sets in database transactions
• Partition computation by partitioning the set of transactions

• a task computes a local count for each item set for its transactions
• sum local count vectors for item sets to produce total count vector

Data Decomposition: Based on Output Data
• Often, each element of the output can be computed independently of others (but
simply as a function of the input).
• Partition the output data across tasks
• Have each task perform the computation for its outputs

Data Decomposition: Based on Output Data -
Example
• Matrix multiplication: C = A x B
• Computation of C can be partitioned into four tasks
• Other task
decompositions
possible

Example
Decomposition I Decomposition II
• A partitioning of output data does Task 1: C1,1 = A1,1 B1,1 Task 1: C1,1 = A1,1 B1,1
not result in a unique Task 2: C1,1 = C1,1 + A1,2 B2,1 Task 2: C1,1 = C1,1 + A1,2 B2,1
decomposition into tasks. For Task 3: C1,2 = A1,1 B1,2 Task 3: C1,2 = A1,2 B2,2
example, for the same problem as Task 4: C1,2 = C1,2 + A1,2 B2,2 Task 4: C1,2 = C1,2 + A1,1 B1,2
in previous foil, with identical Task 5: C2,1 = A2,1 B1,1 Task 5: C2,1 = A2,2 B2,1
output data distribution, we can Task 6: C2,1 = C2,1 + A2,2 B2,1 Task 6: C2,1 = C2,1 + A2,1 B1,1
derive the following two (other) Task 7: C2,2 = A2,1 B1,2 Task 7: C2,2 = A2,1 B1,2
decompositions: Task 8: C2,2 = C2,2 + A2,2 B2,2 Task 8: C2,2 = C2,2 + A2,2 B2,2

Example
• Count the frequency of item sets
in database transactions
• Partition computation by
partitioning the item sets to
count
• each task computes total count for
each of its item sets
• append total counts for item
subsets to produce result

Example
From the previous example, the following observations can be made:
• If the database of transactions is replicated across the processes, each

task can be independently accomplished with no communication.
• If the database is partitioned across processes as well (for reasons of
memory utilization), each task first computes partial counts. These
counts are then aggregated at the appropriate task.

Data Decomposition: Partitioning Input and
Output Data
• Partition on both input and
output for more concurrency
• Example: itemset counting

Data Decomposition: Intermediate Data
Partitioning
• If computation is a sequence of transforms
• (from input data to output data)
• Can decompose based on data for intermediate stages

Data Decomposition: Intermediate Data Partitioning:
Example
• Let us revisit the example of
dense matrix multiplication. We
first show how we can visualize
this computation in terms of
intermediate matrices D.

Data Decomposition: Intermediate Data Partitioning:
Example Stage I
• A decomposition of
intermediate data
structure leads to
the following
decomposition into Stage II
8 + 4 tasks:
Task 01: D1,1,1= A1,1 B1,1 Task 02: D2,1,1= A1,2 B2,1
Task 03: D1,1,2= A1,1 B1,2 Task 04: D2,1,2= A1,2 B2,2
Task 05: D1,2,1= A2,1 B1,1 Task 06: D2,2,1= A2,2 B2,1
Task 07: D1,2,2= A2,1 B1,2 Task 08: D2,2,2= A2,2 B2,2
Task 09: C1,1 = D1,1,1 + D2,1,1 Task 10: C1,2 = D1,1,2 + D2,1,2
2022 Task 11: Parallel
C2,1 = D1,2,1 |+UNITEN
Computing D2,2,1 Task 12: C2,,2 = D1,2,2 + D2,2,2 47
Partitioning: Example

Partitioning - Owner Computes Rule
• Each piece of information is assigned to a thread
• Each thread computes values associated with its data implications
• input data decomposition
• all computations using an input datum are performed by its thread
• output data decomposition
• an output is computed by the thread assigned to the output data

Exploratory Decomposition
• Is used to decompose problems whose underlying computations
correspond to a search of a space for solutions.
• Exploration (search) of a state space of solutions
• problem decomposition reflects shape of execution
• Examples
• discrete optimization
• theorem proving
• game playing

Exploratory Decomposition: Example
Solving a 15 puzzle
• Sequence of three moves from state (a) to final state
(d)
• From an arbitrary state, must search for a solution

Exploratory Decomposition: Example
Search
— generate successor states of the current state
— explore each as an independent task

Exploratory Decomposition: Speedup
• Parallel formulation may perform a different amount of work
• It can be either smaller or greater than serial
Total serial work: 2m + 1 Total serial work: m

Total parallel work : 1 Total parallel work : 4m

Speculative Decomposition
• In some applications, dependencies between tasks are not known a-
priori.
• For such applications, it is impossible to identify independent tasks.
• There are generally two approaches to dealing with such applications:
• conservative approaches - identify independent tasks only when they are
guaranteed to not have dependencies
• optimistic approaches - schedule tasks even when they may potentially be
erroneous.
• Conservative approaches may yield little concurrency and optimistic
approaches may require roll-back mechanism in the case of an error.

Speculative Decomposition : Example
A classic example of speculative decomposition is in discrete event simulation.
• The central data structure in a discrete event simulation is a time-ordered event
list.
• Events are extracted precisely in time order, processed, and if required, resulting
events are inserted back into the event list.
• Consider your day today as a discrete event system - you get up, get ready, drive
to work, work, eat lunch, work some more, drive back, eat dinner, and sleep.
• Each of these events may be processed independently, however, in driving to
work, you might meet with an unfortunate accident and not get to work at all.
• Therefore, an optimistic scheduling of other events will have to be rolled back.

Speculative Decomposition : Example
• Another example is the simulation
of a network of nodes (for
instance, an assembly line or a
computer network through which
packets pass). The task is to
simulate the behavior of this
network for various inputs and
node delay parameters (note that
networks may become unstable
for certain values of service rates,
queue sizes, etc.).

Exploratory vs Speculative
• Speculative decomposition is different from exploratory decomposition in the
following way.
• In speculative decomposition, the input at a branch leading to multiple parallel tasks is
unknown,
• whereas in exploratory decomposition, the output of the multiple tasks originating at a
branch is unknown.
• A parallel program employing speculative decomposition may performs more
aggregate work than its serial counterpart.
• On the other hand, in exploratory decomposition, the serial algorithm too may
explore different alternatives one after the other, because the branch that may
lead to the solution is not known beforehand.
• Therefore, the parallel program may perform more, less, or the same amount of
aggregate work compared to the serial algorithm depending on the location of
the solution in the search space.

Hybrid Decompositions
Often, a mix of decomposition techniques is necessary for decomposing a problem. Consider the
following examples:
• In quicksort, recursive decomposition alone limits concurrency (Why?). A mix of data and recursive
decompositions is more desirable.
• In discrete event simulation, there might be concurrency in task processing. A mix of speculative
decomposition and data decomposition may work well.
• Even for simple problems like finding a minimum of a list of numbers, a mix of data and recursive
decomposition works well.
Hybrid decomposition for finding the minimum of an array of size 16 using four tasks
Process and Mapping

Processes and Mapping
• In general, the number of tasks in a decomposition exceeds the number of
processing elements available.
• For this reason, a parallel algorithm must also provide a mapping of tasks
to processes.
• Note: We refer to the mapping as being from tasks to processes, as

opposed to processors. This is because typical programming APIs, as we
shall see, do not allow easy binding of tasks to physical processors. Rather,
we aggregate tasks into processes and rely on the system to map these
processes to physical processors. We use processes, not in the UNIX sense
of a process, rather, simply as a collection of tasks and associated data.

• Appropriate mapping of tasks to processes is critical to the parallel
performance of an algorithm.
• Mappings are determined by both the task dependency and task
interaction graphs.
• Task dependency graphs can be used to ensure that work is equally
spread across all processes at any point (minimum idling and optimal
load balance).
• Task interaction graphs can be used to make sure that processes need
minimum interaction with other processes (minimum
communication).

An appropriate mapping must minimize parallel execution time by:
• Mapping independent tasks to different processes.
• Assigning tasks on critical path to processes as soon as they become available.
• Minimizing interaction between processes by mapping tasks with dense

interactions to the same process.
Note: These criteria often conflict with each other. For example, a decomposition
into one task (or no decomposition at all) minimizes interaction but does not
result in a speedup at all! Can you think of other such conflicting cases?

Processes and Mapping: Example
Task 1 Task 2 Task 3 Task 4 Task 1 Task 2 Task 3 Task 4
10 P0 10 P1 10 P2 10 P3 10 P0 10 P1 10 P2 10 P3
Task 5 6 P0
Task 5 10 P0 Task 6 6 P1
Task 6 12 P0
Task 7 8 P0 Task 7 8 P0
Mapping tasks in the database query decomposition

to processes. These mappings were arrived at by
viewing the dependency graph in terms of levels (no
two nodes in a level have dependencies). Tasks within
a single level are then assigned to different processes.
Mapping
• Finding a balance that optimizes the overall parallel performance is
the key to a successful parallel algorithm.
• Therefore, mapping of tasks onto processes plays an important role in
determining how efficient the resulting parallel algorithm is.
• Even though the degree of concurrency is determined by the
decomposition, it is the mapping that determines how much of that
concurrency is actually utilized, and how efficiently.

Mapping Techniques
• Once a problem has been decomposed into concurrent tasks, these
must be mapped to processes (that can be executed on a parallel
platform).
• Mappings must minimize overheads.
• Primary overheads are communication and idling.
• Minimizing these overheads often represents contradicting objectives
: minimizing one increases the other
• Assigning all work to one processor trivially minimizes communication
at the expense of significant idling. minimizing serialization introduces
communication

Mapping Techniques for Minimum Idling
• Must simultaneously minimize idling and load balance
• Balancing load alone does not minimize idling
• Same work load on each processor but different completion time

Mapping Techniques for Minimum Idling
• Mapping techniques can be static or dynamic
• Static Mapping: Tasks are mapped to processes a-priori. For this to work, we
must have a good estimate of the size of each task. Even in these cases, even
so, computing an optimal mapping may still be hard
• Dynamic Mapping: Tasks are mapped to processes at runtime. This may be
because the tasks are generated at runtime, or that their sizes are not known.
• Other factors that determine the choice of techniques include the
size of data associated with a task and the nature of underlying
domain.

Schemes for Static Mapping
• Data partitioning
• Task graph partitioning
• Hybrid strategies

Mappings Based on Data Partitioning
• Partition computation using a combination of
• data partitioning
• owner-computes rule
• The simplest data decomposition schemes for dense matrices are 1-D
block distribution schemes.

Block Array Distribution Schemes
• Multi-dimensional block distributions
• Multi-dimensional partitioning enables larger # of processes

Block Array Distribution Example
• For multiplying two dense matrices C = A x B, we can partition the
output matrix C using a block decomposition.
• For load balance, we give each task the same number of elements of
C. (Note that each element of C corresponds to a single dot product.)
• The choice of precise decomposition (1-D or 2-D) is determined by
the associated communication overhead.
• In general, higher dimension decomposition allows the use of larger
number of processes.
• Select to minimize associated communication overhead

Data Usage in Dense Matrix
Multiplication
Data sharing needed for matrix multiplication with (a) one-dimensional and (b) two-dimensional partitioning
of the output matrix. Shaded portions of the input matrices A and B are required by the process that
computes the shaded portion of the output matrix C.

Other Mapping Techniques
• Cyclic and Block Cyclic Distributions
• Graph Partitioning
• Mappings Based on Task Partitioning

Parallel Algorithm
Design Models

Parallel Algorithm Models
An algorithm model is a way of structuring a parallel algorithm by
selecting a decomposition and mapping technique and applying the
appropriate strategy to minimize interactions.
• Data Parallel Model: Tasks are statically (or semi-statically) mapped

to processes and each task performs similar operations on different
data.
• Task Graph Model: Starting from a task dependency graph, the
interrelationships among the tasks are utilized to promote locality or
to reduce interaction costs.

Parallel Algorithm Models (continued)
• Master-Slave Model: One or more processes generate work and
allocate it to worker processes. This allocation may be static or
dynamic.
• Pipeline / Producer-Consumer Model: A stream of data is passed
through a succession of processes, each of which perform some task
on it.
• Hybrid Models: A hybrid model may be composed either of multiple
models applied hierarchically or multiple models applied sequentially
to different phases of a parallel algorithm.

References
• Adapted from slides “Principles of Parallel Algorithm Design” by
Ananth Grama
• Based on Chapter 3 of “Introduction to Parallel Computing” by
Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar.
Addison Wesley, 2003
• http://www-users.cs.umn.edu/~karypis/parbook/

2022 Universiti Tenaga Nasional. All rights reserved.

Parallel Computing Unit 3 - Principles of Parallel Computing Design

Uploaded by

Copyright:

Available Formats

You might also like

Parallel Computing Unit 3 - Principles of Parallel Computing Design

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Parallel Computing Unit 3 - Principles of Parallel Computing Design

Uploaded by

Copyright:

Available Formats

Unit 3 – Principles of Parallel

2022 Parallel Computing | UNITEN 1

2022 Parallel Computing | UNITEN 2

2022 Parallel Computing | UNITEN 3

2022 Parallel Computing | UNITEN 5

2022 Parallel Computing | UNITEN 6

2022 Parallel Computing | UNITEN 7

2022 Parallel Computing | UNITEN 8

9834Prius 2001 Green CA $18,000 ID# Color

4395Civic 2001 Red CA $17,000

2022 Parallel Computing | UNITEN 12

2022 Parallel Computing | UNITEN 14

2022 Parallel Computing | UNITEN 15

2022 Parallel Computing | UNITEN 16

2022 Parallel Computing | UNITEN 17

2022 Parallel Computing | UNITEN 18

Critical Path Length (COLOR = "GREEN" OR COLOR = "WHITE")

Decomposition → Task Dependency Graph

Decomposition → Task Dependency Graph

2022 Parallel Computing | UNITEN 20

2022 Parallel Computing | UNITEN 22

2022 Parallel Computing | UNITEN 23

2022 Parallel Computing | UNITEN 24

2022 Parallel Computing | UNITEN 25

2022 Parallel Computing | UNITEN 26

2022 Parallel Computing | UNITEN 27

2022 Parallel Computing | UNITEN 28

2022 Parallel Computing | UNITEN 29

2022 Parallel Computing | UNITEN 30

2022 Parallel Computing | UNITEN 31

2022 Parallel Computing | UNITEN 32

2022 Parallel Computing | UNITEN 33

2022 Parallel Computing | UNITEN 34

2022 Parallel Computing | UNITEN 35

2022 Parallel Computing | UNITEN 36

2022 Parallel Computing | UNITEN 37

• Partition computation by partitioning the set of transactions

2022 Parallel Computing | UNITEN 38

2022 Parallel Computing | UNITEN 39

2022 Parallel Computing | UNITEN 40

2022 Parallel Computing | UNITEN 41

2022 Parallel Computing | UNITEN 42

• If the database of transactions is replicated across the processes, each

2022 Parallel Computing | UNITEN 43

2022 Parallel Computing | UNITEN 44

2022 Parallel Computing | UNITEN 45

2022 Parallel Computing | UNITEN 46

2022 Parallel Computing | UNITEN 48

2022 Parallel Computing | UNITEN 49

2022 Parallel Computing | UNITEN 50

• From an arbitrary state, must search for a solution

2022 Parallel Computing | UNITEN 52

Total serial work: 2m + 1 Total serial work: m

2022 Parallel Computing | UNITEN 53

2022 Parallel Computing | UNITEN 54

2022 Parallel Computing | UNITEN 55

2022 Parallel Computing | UNITEN 56

2022 Parallel Computing | UNITEN 57