Professional Documents
Culture Documents
Parallel and Distributed Algorithms: Johnnie W. Baker
Parallel and Distributed Algorithms: Johnnie W. Baker
Parallel and Distributed Algorithms: Johnnie W. Baker
Algorithms
Spring 2007
Johnnie W. Baker
Chapter 1
References
Outline
Weather Prediction
Atmosphere is divided into 3D cells
Data includes temperature, pressure, humidity,
wind speed and direction, etc
Recorded at regular time intervals in each cell
There are about 5103 cells of 1 mile cubes.
Calculations would take a modern computer
over 100 days to perform calculations needed
for a 10 day forecast
Details in Ian Fosters 1995 online textbook
Design & Building Parallel Programs
See authors online copy for further information
Flynns Taxonomy
Best-known classification scheme for parallel
computers.
Depends on parallelism it exhibits with its
Instruction stream
Data stream
A sequence of instructions (the instruction stream)
manipulates a sequence of operands (the data
stream)
SISD
Single Instruction, Single Data
Single-CPU systems
i.e., uniprocessors
SIMD
Single instruction, multiple data
One instruction stream is broadcast to all
processors
Each processor (also called a processing
element or PE) is very simplistic and is
essentially an ALU;
PEs do not store a copy of the program nor
have a program control unit.
Individual processors can be inhibited from
participating in an instruction (based on a
data test).
SIMD (cont.)
All active processor executes the same
instruction synchronously, but on different
data
On a memory access, all active
processors must access the same location
in their local memory.
The data items form an array and an
instruction can act on the complete array
in one cycle.
SIMD (cont.)
Quinn calls this architecture a processor
array
Two examples are Thinking Machines
Connection Machine CM2
Connection Machine CM200
MISD
Multiple instruction, single data
Quinn argues that a systolic array is an
example of a MISD structure (pg 55-57)
Some authors include pipelined
architecture in this category
This category does not receive much
attention from most authors
MIMD
Multiple instruction, multiple data
Processors are asynchronous, since they
can independently execute different
programs on different data sets.
Communications is handled either by use
of message passing or through shared
memory.
Considered by most researchers to
contain the most powerful, least restricted
computers.
MIMD (cont.)
Have major communication costs that are
usually ignored when
comparing to SIMDs
when computing algorithmic complexity
Multiprocessors
(Shared Memory MIMDs)
All processors have access to all memory
locations .
The processors access memory through some
type of interconnection network.
This type of memory access is called uniform memory
access (UMA) .
Multiprocessors (cont.)
Normally, fast cache is used with NUMA
systems to reduce the problem of different
memory access time for PEs.
This creates the problem of ensuring that all copies of
the same date in different memory locations are
identical.
Multicomputers
(Message-Passing MIMDs)
Processors are connected by an interconnection network
Each processor has a local memory and can only access
its own local memory.
Data is passed between processors using messages, as
dictated by the program.
A common approach to programming multiprocessors is
to use a message passing language (e.g., MPI, PVM)
The problem is divided into processes that can be
executed concurrently on individual processors. Each
processor is normally assigned multiple processes.
Functional/Control/Job Parallelism
Involves executing different sets of
operations on different data sets.
Typical MIMD control parallelism
Problem is divided into different non-identical
tasks
Tasks are distributed among the processors
Tasks are usually divided between processors so
that their workload is roughly balanced
Grain Size
Defn: Grain Size is the average number of
computations performed between
communication or synchronization steps
See Quinn textbook, page 411
No shared memory
Memory distributed equally between PEs
PEs communicate through binary tree network.
Each PE stores a copy of the program.
PE execute different instructions
Normally synchronous, but asychronous is also
possible.
Only leaf and root PEs have I/O capabilities.
Average Case
Used occasionally in synchronous computation in the
same manner as in sequential computation
Usual default for asynchronous computation
Uniform Analysis
Charge constant time for memory access
Unrealistic, as shown for PRAM in Chapter 2 of Akls book
Non-Uniform Analysis
Charge O(lg M) cost for a memory access, where M is the
number of memory cells
More realistic, based on how memory is accessed
Running Time
Running Time for algorithm
Time between when first processor starts executing &
when last processor terminates.
Based on worst-case, unless stated otherwise
Speedup
A measure of the decrease in running time due
to parallelism
Let t1 denote the running time of the fastest
(possible/known) sequential algorithm for the
problem.
If the parallel algorithm uses p PEs, let tp denote
its running time.
The speedup of the parallel algorithm using p
processors is defined by S(1,p) = t1/tp .
The goal is to obtain largest speedup possible.
For worst (average) case speedup, both t1 and tp
should be worst (average) case, respectively.
Tendency to ignore these details for asynchronous
computing involving multiple concurrent tasks.
Optimal Speedup
Theorem: The maximum possible speedup for parallel
computers with n PEs for traditional problems is n.
Proof (for traditional problems):
Assume a computation is partitioned perfectly into n
processes of equal duration.
Assume no overhead is incurred as a result of this
partitioning of the computation (e.g., partitioning
process, information passing, coordination of
processes, etc),
Under these ideal conditions, the parallel computation
will execute n times faster than the sequential
computation.
The parallel running time is ts /n.
Then the parallel speedup of this computation is
S(n,1) = ts /(ts /n) = n
Linear Speedup
Preceding slide argues that
S(n,1) n
and optimally S(n,1) = n.
This speedup is called linear since S(n,1) = (n)
The next slide formalizes this statement and
provides an alternate proof.
Speedup Example
Example: Compute the sum of n numbers using
a binary tree with n leaves.
Takes tp = lg n steps to compute the sum in
the root node.
Takes t1 = n-1 steps to compute the sum
sequentially.
the number of processors is p =2n-1, so for
n>1,
S(1,p) = n/(lg n) < n < 2n - 1 = p
Solution is not optimal.
Superlinear Speedup
Superlinear speedup occurs when S(n) > n
Most texts besides Akls and Quinns argue that
Linear speedup is the maximum speedup obtainable.
The preceding optimal speedup proof is used to
argue that superlinearity is impossible since linear
speedup is optimal.
Superlinearity (cont)
Some problems cannot be solved without
use of parallel computation.
It seems reasonable to consider parallel
solutions to these to be superlinear.
Examples include many large software
systems
Data too large to fit in the memory of a sequential
computer
Problem contain deadlines that must be met, but
the required computation is too great for a
sequential computer to meet deadlines.
A Superlinearity Example
Example: Tracking
6,000 streams of radar data arrive during
each 0.5 second and each corresponds to a
aircraft.
Each must be checked against the projected
position box for each of 4,000 tracked
aircraft
The position of each plane that is matched is
updated to correspond with its radar location.
This process must be repeated each 0.5
seconds
A sequential processor can not process data
this fast.
Counterexample to Speedup
Folklore Theorem
A sequential solution to previous example
required n2 steps
The parallel solution to it took log n steps
using a binary tree with steps using p =
2n-1 = (n) processors.
The speedup is S(p,1) = t1/tn = n2/ log(n),
which is asymptotic larger than p, the
number of processors.
Akl calls this speedup parallel synergy.
Counterexample to Slowdown
Folklore Theorem
A parallel solution to previous example with q = 2n-1 =
(n) PEs had tq = (n 1)n + (lg n)
The parallel solution using a binary tree with steps using
p = 2n-1 = (n) PEs took tp = lg n.
Slowdown Folklore Thm states that tp tq (1 + p/q) tp
Clearly tq/tp is asymptotically greater than p/q, which
contradicts Slowdown Folklore Theorem
(p/q) tp = (2n-1)/(2n-1) (lg n)= (n lg n)
tq = (n 1)n + (lg n) = (n3/2)
Akl also calls this asymptotical speedup parallel synergy.
Cost
The cost of a parallel algorithm is defined by
Cost = (running time) (Nr. of PEs)
= tp n
Cost allows the performance of parallel algorithms to be
compared to that of sequential algorithms.
The cost of a sequential algorithm is its running time.
The advantage that parallel algorithms have in using
multiple processors is removed by multiplying their
running time by the number n of processors used
If a parallel algorithm requires exactly 1/n the running
time of a sequential algorithm, then the parallel cost is
the same as the sequential running time.
Cost Optimal
A parallel algorithm for a problem is cost-optimal if its
cost is proportional to the running time of an optimal
sequential algorithm for the same problem.
By proportional, we means that
cost tp n = k ts
for some constant k.
Equivalently, a parallel algorithm with cost C is cost
optimal if there is an optimal sequential algorithm with
running time ts and (C) = (ts).
If no optimal sequential algorithm is known, then the cost
of a parallel algorithm is usually compared to the running
time of the fastest known sequential algorithm instead.
Work
Defn: The work of a parallel algorithm is the
sum of the individual steps executed by all the
processors.
While inactive processor time is included in cost, only
active processors time is included in work.
Work indicates the actual steps that a sequential
computer would have to take in order to simulate the
action of a parallel computer.
Observe that the cost is an upper bound for the work.
While this definition of work is fairly standard, a few
authors use our definition of cost to define work.
Algorithm Efficiency
Efficiency is defined by
Observe that
S ( n)
E
n
ts
E (1, n)
ts
tp n
cost
Amdahls Law
Let f be the fraction of operations in a
computation that must be performed
sequentially, where 0 f 1. The
maximum speedup S(n) achievable by a
parallel computer with n processors is
1
1
S ( n)
f (1 f ) / n f
Note: Amdahls law holds only for traditional
problems.
ts
S ( n)
tp
ts
(1 f )t s
ft s
n
1
(1 f )
f
n
S ( n)
nf (1 f )
1 ( n 1) f
Limits on Parallelism
Note that Amdahls law places a very strict limit
on parallelism:
Example: If 5% of the computation is serial,
then its maximum speedup is at most 20, no
matter how many processors are used, since
1
1
100
S ( n)
20
f
0.05
5