Professional Documents
Culture Documents
Massive Parallelism: Scientific American
Massive Parallelism: Scientific American
Whenever multiple data streams are used, i.e., in the MIMD and SIMD
systems, another type of classification is needed to describe the organiza
tion of memory. Memory organization is classified either as shared, or dis
tributed. In a shared memory system the multiple data streams are stored
in memory that is common to all processors. In a distributed memory sys
tem each processor has its own local memory, and data are transferred
among processors, as needed by the code, using a communication network.
The characteristics of the communication network are important for the
design of parallel machines, but they are not essential for the implementa
tion of a parallel algorithm. Communication networks are treated only in
passing in this book, but references on the topic are given in Section 1.5.
Multiprocessor systems are also characterized by the number of avail
able processors. Small-scale parallel systems have up to 16 processors,
medium-scale systems up to 128, and large-scale systems up to 1024. Sys
tems with more than 1024 processors are considered massively parallel. Fi
nally, multiprocessor systems are also characterized as coarse-grain or fine-
grain. In coarse-grain systems each processor is very powerful, typically of
the kind found in contemporary workstations, with several megabytes of
memory. Fine-grain systems typically use simple processing elements with
a few kilobytes of local memory each. For example, an Intel iPSC/860
system with 1024 processors is a coarse-grain parallel machine. The Con
nection Machine CM-2 with up to 64K (here IK = 1024) simple processing
elements is a fine-grain, massively parallel machine.
A type of computer that deserves special classification is the vector
computer, which has special vector registers for storing arrays and special
vector instructions that operate on vectors. While vector computers are
a simple, special case of SIMD machines, they are usually considered as
a different class because vector registers and vector instructions appear in
the processors of many parallel MIMD systems as well. Also, the develop
ment of algorithms or software for a vector computer (as, for example, the
CRAY Y-MP/4 or the IBM 3090-600/VF) poses different problems than
8 Introduction Chap. 1
the design of algorithms for a system with multiple processors that operate
synchronously on multiple data (as, for example, the Connection Machine
CM-2 or the DAP).
20
15
10
5
Number of workstations
Control Parallelism
In control parallelism the instructions of a program are partitioned into sets
of instructions that are independent of each other and therefore can be ex
ecuted concurrently. A well-known type of control parallelism is pipelining.
Consider, for example, the simple operation, commonly called SAXPY,
Y <- aX + Y,
Data Parallelism
In data parallelism, instructions are performed on many data elements si
multaneously by many processors. For example, with n processors the
SAXPY operation can be completed in two steps: one step executes the
be a problem for the programmer and not for the mathematician who
is designing the algorithm. However, since parallel computers come
in a variety of architectures, this problem can not be left only to the
programmer. At the one extreme of parallel architectures we have
We see that both the structure of the mathematical algorithm and the
structure of the underlying application play a crucial role in parallel com
putations. It may be possible to decompose the problem into domains that
can be solved independently. For example, in image reconstruction from
projections we may consider the decomposition of a discretized cross sec
tion of the image into subsections. If the subsections are properly chosen
the image could be reconstructed from these smaller sections.
This is the background on which this book is developed. We argue
that it is not enough to just know what a parallel computer is. And for
a mathematician, it is not enough to “develop parallel algorithms.” In
order to successfully attack a significant real-world problem we have to
study carefully the modeling process, the appropriate parallel computing
architecture, and develop algorithms that are mathematically sound and
suitable for both.
We stress, however, that the design of an algorithm should not be linked
to the specifics of a parallel computer architecture. Such an approach would
render the algorithm obsolete once the architecture changes. Given also the
diversity of existing parallel architectures, this approach would find lim
ited applicability. A basic premise of this book is that algorithms should
be flexible enough to facilitate their implementation on a range of archi
tectures. Indeed, we present algorithms for several optimization problems
Sect. 1.3 A Classification of Parallel Algorithms 15
0,
*
/i(^) i = 1, 2,... ,m, (1.1)
^+1 (1.8)
where t(y) is the control index, 1 < < M, specifying the block which is
used when the algorithmic operator Bt^ generates xy+i from xy and from
using the same current iterate xy. The next iterate is then gen
erated from all intermediate ones by an additional operation
Here S and Ri are algorithmic operators, the R^s being all of the
row-action type.
Sequential Block-Iterative Algorithms. Here the system (1.1) is decom
posed into fixed blocks according to (1.7), and a control sequence
{£(^)}^=o over the set {1,2,. ..,M} is defined. The algorithm per
forms sequentially, according to the control sequence, block iterations
of the form (1.8).
Simultaneous Block-Iterative Algorithms. In this class, block iterations are
first performed using the same current iterate xy, on all blocks simul
taneously
The next iterate z1'+1 is then generated from the intermediate ones
x^1’* by
x’'+i = 5({x'/+1-t}f^1). (1.12)
ing problem. The blocks can be chosen so as to ensure that all processors
receive tasks of the same computational difficulty. Hence, all processors
complete their work in roughly the same amount of time. Delays while
processors wait for each other are minimized. The specification of block
x = ((x1)T,(rr2)T,...,(^)T)T,
A2
(1-16)
\ Tn Wn /
Sect. 1.3 A Classification of Parallel Algorithms 21
the fastest available serial computer, to the solution time with the parallel
algorithm executed on multiple processors of the parallel computer.
scales up, and linear speedup can be expected. Consider a problem with a
fraction s of serial execution time and a parallelizable part that consumes
a fraction p of the execution time. When a P processors system becomes
available the application would be scaled. If the parallel part would scale
and Meyer (1988, 1991, 1994), and Rosen (1990). Pardalos, Phillips, and
Rosen (1992) treat some aspects of mathematical programming with re
spect to their potential for parallel implementations. Extensive lists of
references on parallel computing for optimization are collected in Schna