A Review of Parallel Processing For Statistical Computation: Statistics and Computing (1996) 6, 37 49

Statistics and Computing (1996) 6, 37 ± 49
A review of parallel processing for
statistical computation
N. M. ADAMS, S. P. J. KIRBY, P. HARRIS and D. B. CLEGG
Department of Statistics, Faculty of Mathematics and Computing, The Open University,
Walton Hall, Milton Keynes, MK7 6AA, UK
Received September 1994 and accepted July 1995
Parallel computers dier from conventional serial computers in that they can, in a variety of ways,
perform more than one operation at a time. Parallel processing, the application of parallel
computers, has been successfully utilized in many ®elds of science and technology. The purpose
of this paper is to review eorts to use parallel processing for statistical computing. We present
some technical background, followed by a review of the literature that relates parallel computing
to statistics. The review material focuses explicitly on statistical methods and applications,
rather than on conventional mathematical techniques. Thus, most of the review material
is drawn from statistics publications. We conclude by discussing the nature of the review
material and considering some possibilities for the future.
Keywords: Parallel processing, statistical computing, Flynn's taxonomy, parallel software
tools, parallel performance
1. Introduction computational properties of many of these designs are
well understood. Most of the basic tools of statistical
We present a review of the literature pertaining to the explicit computing, including linear algebra, sorting and random
application of parallel processing to statistics. Literature number generation have been implemented on parallel
of this type is scarce, despite many authors (including computers, though not necessarily explicitly for the purpose
Sylwestrowicz, 1982 and HavraÂnek and StratkosÏ, 1989) of statistical computing. Section 3 provides a review of the
commenting on the utility of parallel processing for statis- available literature relating parallel processing to statistics,
tics. The review was conducted in order to locate the together with references to the parallelization of major
available literature in this ®eld and to isolate potential numerical methods. Some details of the literature search
research areas. The elements of statistical computation methods employed in this review are also brie¯y described.
that we consider are limited explicitly to numerical statis- Finally, Section 4 contains some closing comments about
tical algorithms and statistical applications. Aspects such the nature of the review material and the likely prospects
as linear algebra, optimization, and quadrature in the for parallel processing in statistics.
mathematical formulation are only brie¯y considered.
Also outside the scope of the review are such areas as parallel
computation for graphics and symbolic computation.
Parallel processing, that is the use of parallel computers, 2. Parallel processing
is a vast ®eld. Quinn (1987) suggested that the 1990s would
be the decade of the parallel computer. Certainly, parallel Parallelism is the process of performing tasks concurrently,
computers provide computational power to solve otherwise that is, more than one task per unit time. A parallel computer
intractable problems (Kaufmann and Smarr, 1993). In is a computer that has the ability to exploit parallelism
Section 2 we give a brief introduction to parallel processing, incorporated in its architecture. This architecture usually
including outline descriptions of hardware and software consists of a collection of processing units coupled by an
development tools. interconnection network.
Many dierent designs of parallel computer exist. The Eddy (1986) and Eddy and Schervish (1991) have made
0960-3174 # 1996 Chapman & Hall

38 Adams, Kirby, Harris and Clegg
some eort towards introducing statisticians to parallel
computing by presenting tutorial-style introductory
papers. Ostrouchov (1987) describes computers with
hypercube architecture and gives some applications.
Lewis and El-Rewini (1992) gives an excellent general
introduction to parallel computing. For a more detailed
description of hardware see either Hwang (1993) or Hockney
and Jesshope (1988). Detailed texts on parallel computation
include Freeman and Philips (1992) and Bertsekas and
Tsitsiklis (1989). Dongarra et al. (1991) is devoted to a
discussion of linear algebra on shared memory parallel
computers and includes detailed performance measures

Fig. 1. Diagrammatic representation of an SIMD machine. C, con-
for a variety of machines as well as an extensive
trol unit; P, processors; S, store (memory)
glossary. Wilson (1993) gives a useful glossary of parallel
computing terminology. whereby processors are independent but can access all
A variety of parallel computers exist with radically varying available memory. A single control unit drives the proces-
architecture. Attempts to classify the dierent designs have sing units. SIMD computers typically execute a single
not been completely successful (Hockney and Jesshope, stream of instructions with a number of simple processing
1988) and no universally accepted classi®cation scheme exists. units, each performing the same instructions on its own
In this paper we use Flynn's taxonomy (Flynn, 1972), which data. At a given time, the same instruction is being executed
classi®es computers according to how the machine relates on a collection of processors with each processor manipulat-
its instructions to the data being processed. We choose ing dierent data. All computers in this class therefore have
Flynn's classi®cation because it is adequate for describing synchronous operation, that is, access to shared memory is
parallel algorithms as well as hardware, and it is widely tightly coordinated. This is the most useful model for
used (Freeman and Philips, 1992). Within Flynn's taxon- massively parallel (incorporating many processors) scien-
omy, a stream is a sequence of items (instructions or data) ti®c computing with many engineering and scienti®c tasks
executed or operated on by a processor. There are four falling naturally in this class including image processing,
broad categories, dealing with single and multiple streams particle simulation and ®nite element methods (Lewis
of items: and El-Rewini, 1992).
SISD: Single Instruction Single Data stream. A con-

The SIMD model includes: array processors, pipelined
vector processors and systolic arrays.

ventional sequential computer. Each instruction initiates
an operation.
Array processors consist of a set of elementary proces-
SIMD: Single Instruction Multiple Data stream. A
sing units connected by a grid, which is usually square.
computer that has a single stream of instructions that
These machines are well suited to computations
initiate operations on many streams of data.
involving matrix manipulations. Examples include
MISD: Multiple Instruction Single Data stream. This
the AMT Distributed Array Processor and Thinking
category illustrates the problems inherent in interpreting
Machines' Connection Machine. An attached-array is
Flynn's taxonomy. Hwang (1993) puts systolic arrays in
simply a conventional computer with array hardware
this class, while other authors assert that no computers
connected.
fall in this category.
Pipelined vector processors achieve parallelism by two
MIMD: Multiple Instruction Multiple Data stream.
methods. First, arithmetic operations (addition, multi-
A computer with several processing units capable
plication, etc.) are broken down into individual elements
of operating on several data streams. This includes
and computed as a pipeline. Processing data as a pipeline
all forms of multiprocessor and is the most general
can be pictured as performing lower level operations as
form of parallelism.
an assembly line. Second, vector processing units, as
This classi®cation is reasonably well-de®ned although the name suggests, allow arithmetic operations
many modern parallel systems are hybrids of SIMD and between pairs of vector elements fed from vector regis-
MIMD and hence belong not to either class but to both. ters to functional units that use pipelining techniques.
We will describe the SIMD and MIMD models in more The Cray-1 and CDC Cyber 205 are examples of pipe-
detail. lined vector processors. Some more recent systems,
such as the Cray-2 and Convex C3800, consist of a
set of vector processors.

2.1. SIMD
Systolic arrays are a highly specialized SIMD design
The SIMD model is represented pictorially by Fig. 1, that typically consist of a tightly coordinated square
Parallel processing for statistical computation 39
The term topology refers to the manner in which processors
in MIMD and certain SIMD systems are linked. Some sys-
tems have a single topology enforced by their construction.
Others, most notably transputer systems, allow the recon-
®guration of communications links. Various topologies
exist including linear chains, rings, meshes and hypercubes;
see Fig. 3 for some examples. Each topology has unique
computation and communication properties. For more
details of parallel computer topology see Bertsekas and
Tsitsiklis (1989).
Distributed computing systems are an important member
of the MIMD category. These parallel systems are con-

Fig. 2. Diagrammatic representation of an MIMD machine. C, con-
trol unit; P, processor; S, shared or local store (memory) structed by linking independent computers such as work-
stations together via a network and using the resulting
array of processors operating on data accessible from distributed system for a single task. Such systems are
memory around the perimeter of the array. becoming more popular, and have been used for some
statistical applications, such as in the analysis of hierarchical
models and discrete ®nite inference (Schervish, 1988; Eddy
and Schervish, 1986) and exact least median of squares
2.2. MIMD regression (Hawkins et al., 1994). Distributed systems are
often constructed from dierent types of computer, in

Multiprocessor systems form the bulk of this class. Figure 2
which case they are termed heterogeneous.
is a graphical model of MIMD computers. Here, each
processor drives its own control unit and can have its
own memory, although memory could alternatively be 2.3. Software development
shared.
There is a broad range of tools for software development
A multiprocessor system consists of a number of powerful
separate processing elements that are in some way connected.
An individual processor may itself be a scalar or vector
processor, or even a multiprocessor. Multiprocessor systems
can be classi®ed into two groups: shared memory and local
memory systems, although this distinction is blurred in
some modern systems.
Shared memory systems incorporate a set of processing

elements that, by a variety of possible methods of inter-
connection, share access to a common memory area.
Examples of shared memory multiprocessor systems
include the Cray-2, the Encore Multimax and the Alliant
FX/80.
Local or distributed memory systems incorporate multiple

processing elements each with its own private memory.
Here processors communicate directly, and global access
to local memory is generally not allowed. Examples of
local memory systems include the Intel iPSC/1, Ametek
S14 Hypercube and Transputer systems. Whichever
kind of system is being used, processor coordination
must at some time occur. The two methods of coordina-
tion are synchronous and asynchronous. Shared memory
systems have synchronous operation since access to
globally shared memory is coordinated to prevent con-
tention. The processors in a local memory system have
asynchronous operation and coordination is achieved
by message-passing. The type of coordination is an

Fig. 3. Some conceptual diagrams of simple parallel topologies. (A)
important consideration in software and algorithm a linear chain of p processors; (B) a complete ring of p processors;
design. (C) a square mesh of 16 processors

on parallel computing systems. Unfortunately, there is determining processor synchronization points and
currently no consistent standard for parallel languages processor utilization.
or software development tools. Each class of computer

A major asset for statistical programming is the availability
requires tools that re¯ect the architecture. Additionally,
of subroutine libraries, such as NAG (Numerical Algorithms
each manufacturer's machine tends to have a bespoke soft-
Group, UK) or IMSL (International Mathematical and
ware environment that seriously compromises the portability
Statistical Libraries, Houston, Texas, USA) libraries. The
of developed code.
NAG library is implemented on certain vector supercom-
Broadly speaking, parallel languages can be divided into
puters (Du Croz, 1990). The routines are vectorized to
languages designed to accommodate parallelism and exten-
suit the architecture of the speci®c machine. The implemen-
sions to standard sequential languages. In the former group,
tation of NAG routines on supercomputers is based on
the most notable examples are occam and Ada. Parallel
the implementation of the BLAS (Basic Linear Algebra
extensions exist to certain conventional languages (Perrott
Subprograms), see Freeman and Philips (1992). Brophy
1987) including Fortran and Pascal.
et al. (1989) report that vectorized versions of the IMSL
Tools to assist in program development (Freeman and
libraries are also based on the BLAS. Durst (1987) gives
Philips, 1992) include parallelizing compilers, parallel
a detailed discussion of the use of library software for
programming environments and parallel debugging aids.
supercomputing.
A vectorized version of the SAS statistics package is

A parallelizing compiler is capable of receiving source
available on Convex supercomputers at sites including the
code in a sequential language, typically Fortran, iden-
University of London Computing Centre. Ihnen (1989)
tifying areas of potential parallelism and producing
describes the vectorization of the SAS software. Vectorized
parallel executable code. This tool presents the oppor-
SAS utilizes BLAS routines in a similar manner to the
tunity to users to have their sequential algorithms
NAG and IMSL libraries. We know of no other major
implemented on a parallel machine without the labor-
statistics package that has been highly optimized for a
ious task of rewriting them in parallel. However,
supercomputer.
parallelizing compilers do not guarantee good perfor-
mance and some rewriting is often required for
improved results. Examples of parallelizing compilers
include the EPF (Encore Parallel Fortran) compiler 2.4. Assessing performance
(Encore, 1988) available on the Encore Multimax

Many benchmarks exist for comparing the actual speed of
and the CFT (Cray Fortran Translator) compiler
parallel computers for speci®c problems and applications
(Cray-1 Computer Systems, 1981) for the Cray-1.
(Hockney and Jesshope, 1988; Dongarra et al., 1991). To
Gonzalez et al. (1988) reports on tools for an Intel
the best of our knowledge there are no benchmarks of an
iPSC/2 Hypercube that generate parallel Fortran
explicitly statistical nature, although these would be highly
from the sequential form.
desirable for comparison of high performance computers
Parallel programming environments are a coordinated
in this area. Of equal interest here are methods of charac-
set of tools designed to ease the burden of coding by
terizing and measuring program or algorithm perfor-
automating some of the steps in developing parallel
mance. Two terms of particular importance are grain
programs. Existing programming environments only
size and speed-up, although interpreting quoted performance
encompass part of the programming life cycle and
is often dicult, as wryly observed by Bailey (1991).
the development of such environments is in its infancy
(Lewis and El-Rewini, 1992). Two examples include

2.4.1. Grain size
SCHEDULE (Dongarra and Sorenson, 1987) and
the Transputer Development System (Inmos, 1990). Grain size is a qualitative measure of the inherent parallelism
For a detailed description of parallel environments of an algorithm. Grain size refers to the number of instruc-
see Lewis and El-Rewini, or Dongarra and Touran- tions performed in parallel before some kind of processor
cheau (1992). Environments designed to allow porta- synchronization must occur. The parallelism in a given
ble and network parallel programming include PVM algorithm may be ®ne-grain, medium-grain or coarse-grain,
(Parallel Virtual Machine) and LINDA (Carriero with ®ne-grain representing many synchronization points
and Gelernter, 1989). PVM (Geist et al., 1993, 1994) and coarse-grain representing very few. Broadly speaking,
is a widely used and freely available tool for construct- simple scalar ± vector operations are ®ne-grain, vector ±
ing parallel distributed systems. matrix operations are medium-grain and matrix ± matrix
Parallel debugging aids are essential since methods of operations are coarse-grain. Freeman and Philips (1992)
debugging a sequential program are inadequate for note that SIMD machines are generally suited to ®ne-
parallel code. In addition to the usual debugging and medium-grain parallelism and MIMD machines are
facilities, a parallel debugger must provide tools for more suitable for coarse-grain parallelism.
2.4.2. Speed-up a SIMD computer is not as dicult as designing ecient
algorithms for a MIMD computer. Many SIMD

Researchers of parallel algorithms often quote the speed-up
machines have very eective parallelizing compilers,
ratio for an algorithm. Speed-up is merely a measure of an
algorithm's utilization of many processors. Let Tp be the

although to achieve optimal performance some extra pro-
execution time on p processors. The speed-up ratio Sp on

gramming is usually required, often in the form of ad hoc
compiler directives. For MIMD machines the algorithm

p processors is given by
designer has to decide how to decompose the problem
T
Sp 1
: subject to a number of potentially contending issues.
Tp These include keeping all processors as busy as possible,
preventing communication and memory bottlenecks,

This quantity measures the speed-up obtained by paralleliz-
and making the algorithm as fast as possible. A number
ing a particular algorithm. The fastest sequential and parallel
of models for parallel algorithm design have evolved,
algorithms are generally used for such comparisons. Ideally
Sp should increase linearly with p, with gradient 1. This

based on how the problem is decomposed. Of interest
here is the SPMD (Single Program Multiple Data Stream)

is seldom the case since at some stage processors must
model that describes a situation where a single program
communicate. Stewart (1986) gives a detailed example
drives a number of independent processors. The SPMD
of communication within a parallel algorithm. The term
eciency is usually taken as Sp over p and is a measure

model exploits geometric parallelism, whereby the computa-
tional requirements of an algorithm are split into similar

of processor utilization.
independent tasks and each task is assigned to a processor.
Useful tools for parallel algorithm designers are speed-up
The ®nal answer is formed by collecting the intermediate
or timing models (Freeman and Philips, 1992). Using
results from each processor. Dierent strategies exist for
knowledge of a parallel computer's performance, it is pos-
exploiting geometric parallelism, of which the most relevant
sible to construct a speed-up model that provides an estimate
to the review material is the master±slave mode. This involves
of attainable speed-up for a particular algorithm. Such
having one processor (the master) responsible for driving the
models can allow the developer to assess the performance
remaining processors (the slaves), distributing the data and
of a program without writing any code. Al-Jumeily et al.
gathering the results. The key element of this approach is
(1994) give examples of speed-up models for computing the
that the computational eort is split among all processors
arithmetic mean.
in the same manner. Many algorithms require some form
of load balancing. This is an attempt to distribute evenly the
2.5. Amdahl's law amount of work among available processors.
Lafaye de Micheaux (1984) considers construction of

Amdahl's law addresses the problem that a parallel algorithm
parallel MIMD algorithms for statistical data analysis,
is likely to have inherently sequential portions. It gives an
observing that many statistical techniques can be eectively
upper bound to obtainable speed-up. If r is the proportion
parallelized on MIMD machines using the same SPMD
of a program that may be parallelized and s1ÿr is the
model. Methods of partitioning data for simple statistical
remaining inherently sequential proportion then for a system
computations are often appropriate for highly complex
with p processors the speed-up Sp satis®es situations such as projection and clustering. Lafaye de
1
Sp
Micheaux concludes that one way to make high-speed
s r =p processing available for statistical analysis is based on
multiprocessor systems constructed from cheap components,

Two interpretations exist for Amdahl's law (Freeman and
using an SPMD approach with data partitioning.
Philips, 1992). First, it can be seen as imposing an upper
bound on the speed-up ratio, with the consequence that
adding processors above a certain number will not result

3. Parallel processing for statistics: a review
in improved performance. Second, Amdahl's law suggests
that a larger parallel machine should be used to solve

Exploiting parallelism for statistics is not a new idea, nor
ever-larger problems.
is it entirely ignored in major statistical texts. As early as
1977, Chambers (1977) mentioned the use of parallel com-
puting in data analysis situations where the magnitude of

2.6. Exploiting parallelism in statistical algorithms
computation makes interactive analysis impossible.
To achieve ecient performance on a parallel computer, Thisted (1988) brie¯y mentioned parallel computers as
an algorithm must be designed to suit the particular archi- being an ideal method of implementing Jacobi methods
tecture of the computer. It is in the area of algorithm for extracting eigenvalues. Additionally, in his foreword,
design that the distinction between SIMD and MIMD is Thisted commented that his second volume would con-
most acute. In general, designing ecient algorithms for tain statistical algorithms for parallel computers.
An excellent source of statistics-related parallel processing as independent subsets is amenable to parallelization
literature is the Computing Science and Statistics proceedings (Schervish, 1988).
from the annual Symposium on the Interface between Schervish (1988) considered a variety of parallel appli-
computing science and statistics. Organized annually by cations Ð including discrete Ð ®nite inference, a computer-
the Interface Foundation of North America and others, intensive approach for the analysis of discrete dataÐwhere
these proceedings regularly carry a variety of papers on the dominant aspect of the computation is simple summation
parallel processing and statistics. Material published of large sets of data. These applications allow elements of the
here relates to both general parallel processing and the computation to be divided and processed independently. The
application of parallel processing to statistics. parallel computer investigated, ES86, was a distributed sys-
This review is principally based on an initial search of tem constructed from a network of VAX computers. Paral-
two on-line databases, MATHSCI and BIDS (Bath Infor- lel discrete-®nite inference programs achieved near-linear
mation and Data Service Ð an on-line version of the ISI speed-up on the ES86 system. Improved accuracy was also
citation index). Both of these databases provide access reported for the parallel algorithm.
to journals and proceedings. We supplemented computer Skvoretz et al. (1992) employed an NCUBE/10 hypercube
searches with conventional manual search techniques. multiprocessor (MIMD) to assess the application of parallel
It is rather dicult to describe the ®eld of parallel statistical processing to typical large-scale social science research, using
computation systematically. This diculty stems from the census data for subsample selection and cross-tabulation.
fact that many statistical algorithms are constructed from They found that a crucial consideration for good parallel per-
`building-block' numerical algorithms. There is an immense formance is keeping the processors busy while disk access
amount of literature relating to numerical methods on occurs, and achieved this by distributing data to the proces-
parallel machines but we wish to focus on areas that are sors in small packets, rather than en masse (the term packet
either explicitly statistical or numerical methods investigated describes the amount of data being transmitted at a given
in the name of statistics. The review therefore proceeds in interaction). Good performance was reported. In comparison
the following manner. In each section we consider statistical with performance of the software package SPSS-X on an
computing loosely classi®ed in the general manner of the IBM 3081, the parallel method was always faster for large
chapter headings of Thisted (1988). Subsections are orga- (> 80 000 cases) data sets.
nized according to our own classi®cation based on the avail-
ability of material. The review material is divided into two

3.2. Numerical linear algebra
distinct classes: parallelization of conventional numerical
methods, and parallelization of methods speci®cally in the Much material has been published on parallel numerical
name of statistics. In the former group, we present only out- linear algebra, for example Ortega et al. (1990) give a biblio-
line details because of the amount of such material. The latter graphy relating to this subject containing 2000 entries! For
group, explicitly statistical in nature, is given more attention. excellent descriptions of parallel linear algebra see either
For all of the review material we focus attention on the Freeman and Philips (1992), Golub and Ortega (1993),
reported bene®ts and drawbacks of a parallel processing Dongarra et al. (1991) or Modi (1988). Many numerical
approach. linear algebra methods have been considered for a wide
A problem with methods developed for particular parallel variety of parallel machines. These include both iterative
computers is a lack of portability. Thus, published parallel and direct methods and approaches for sparse matrices.
algorithms are unlikely to be readily ported to machines Stewart (1988) gave an overview of how statistical linear
with dierent architectures. A current approach is to publish algebra computations may be mapped onto dierent parallel
algorithmic `skeletons'. The following subsections review architectures, stressing the importance of the architecture and
parallel work relating to: the central role of linear algebra in statistical computing. Of
SIMD machines Stewart concluded that although many

basic numerical methods (3.1)
elementary processors are applied to a problem, their utility
numerical linear algebra (3.2)
depends on algorithms requiring repetitive bursts of com-
non-linear statistical methods (3.3)
putation. Stewart suggested various solutions to problems
numerical integration and density estimation (3.4)
aecting the performance of distributed memory MIMD
seminumerical methods (3.5)
machines. These include careful implementation of parallel
other (3.6).
algorithms to avoid inecient coding, trying to reduce
processor start-up time and running larger jobs to try to
maximize the ratio of computation to communication.

3.1. Basic numerical methods
Parallel computers provide an ideal platform for applying 3.2.1. Linear regression
simple techniques to large data sets. Any statistical tech- HavraÂnek and StratkosÏ (1989) extended the work of Stewart
nique that allows data to be broken down and processed into a practical situation, focusing on multiple linear
regression, using a host computer/attached-array processor The vectorized sorting component gave no speed-up.
(SIMD). It is noted that iterative algorithms often give Kapenga and McKean suggest that using vectorized
better parallel performance than elimination methods methods makes robust analysis of this type feasible for
(StratkosÏ , 1987). HavraÂ nek and StratkosÏ consider various most problems of moderate size.
parallel approaches to Cholesky factorization. Interestingly, Kaufman et al. (1988) considered the application of
models including the biggest numbers of parameters, applied parallelism to resampling methods. Illustrative examples
to the largest test data set, yielded intractable computations, based on parallel algorithms for cluster analysis (see Section
due to restricted memory. Quoted speed-up results were 3.3.1) and robust regression demonstrated the applicability of
rather promising. The authors showed that their algorithm's parallelism to resampling methods. The computer investi-
performance does not depend critically on the size of data set. gated was the 1CAP 1 IBM research machine (MIMD),
Xu et al. (1989) considered multiple linear regression consisting of a host processor and 10 connected array pro-
models on an Intel iPSC/2 using an SPMD type decomposi- cessors. Attention focused on the least-median of squares
tion, for both improved performance and resistance to in robust regression, with consideration of dierent paral-
processor failure. They also considered the importance lelization strategies for a sequential program. The chosen
of packet size during parallel communication. Xu et al. method involved an SPMD type approach. Kaufman et
report good speed-up, this being of the order of 14 for al. noted that inter-processor communication is often a
16 processors. Amdahl's law manifested itself here, as critical factor and some form of load balancing is often
14 is shown to be an asymptotic limit on speed-up. necessary.
Kleijnen (1990) employed a Cyber 205 supercomputer Xu and Shiue (1993) developed three SPMD-type parallel
(SIMD) for a Monte Carlo experiment to compare the algorithms for least median of squares (LMS) regression
performance of Rao's test for validity of a regression (Rousseeuw, 1984), using an Intel iPSC/2 MIMD computer.
model with a cross-validation approach in multivariate Parallelization strategies were deduced from consideration of
regression. The parallel code performed well for ordinary the nested loop structure of sequential implementations. Xu
least squares estimates (OLS), and for general least and Shiue made extensive use of speed-up models to assess
squares (GLS) estimates based on large samples. How- dierent formulations. Load-balancing strategies were
ever, GLS estimates obtained from small samples actually used to improve algorithms where appropriate. An exact
resulted in speed-down, that is slower absolute time using algorithm and an approximate algorithm yielded near linear
parallelism than a single processor. speed-up, while a parallelized version of a fast sequential
Kleijnen and Annink (1992) gave a detailed description algorithm resulted in catastrophic speed-down due to severe
of the vectorization of a program for Monte Carlo simulation load imbalance.
of regression analysis. They found that OLS estimates yielded Hawkins et al. (1994) investigated various con®gurations
good parallel performance due to eective vectorization. for the distributed computation of exact LMS regression.
However, the matrix inversion in their GLS algorithm Distributed computation (Section 2.2) in this case means
did not bene®t from vectorization and speed-up was partitioning a sequential program onto a group of processors,
diminished relative to the OLS method. which may be independent computers, and gathering results
As part of an investigation into parallel processing for on completion. Here, the processors need not be dedicated
social science research (see Section 3.1) Skvoretz et al. to a single task. Within Flynn's taxonomy this is a distributed
(1992) experimented with large regression models. The memory MIMD computer. A list of tasks is presented to
parallel component of the computation involved computing partition the problem into a distributed solution. The
the covariance matrix in an SPMD manner. Experiments code gave good speed-up (eciency is quoted), particularly
involved computing regression models using dierent for large sample sizes, on a variety of systems, including a
numbers of processors and reading varying amounts of 22-processor heterogeneous system that required extra
data from disk. They found that the latter consideration coding because of the added complexity of dierent vendors'
was critical in trying to obtain good performance. hardware.
3.2.2. Robust regression 3.2.3. Subset selection
Kapenga and McKean (1987) considered vectorized algo- Mitchell and Beauchamp (1986) developed a Bayesian
rithms for the robust R-estimates of Jaeckel (1972). In method of subset selection in linear regression. A branch
particular, their purpose was to assess the vectorization and bound algorithm was implemented on a Cray X-MP
of the k-step algorithm of McKean and Hettmansperger supercomputer (SIMD). Computing large regression models
(1978). Experiments performed on a FPS-264 array pro- with many predictors motivated use of the supercomputer.
cessor (SIMD) are reported as achieving considerable With 25 predictor variables, reported as the largest test
speed-up at the highest levels of parallel optimization. case, the algorithm required one second of cpu time on
The vectorized algorithm had three major contributors the Cray. Sequential timings are not reported. Like the
a QR decomposition, a projection and a sorting stage. work of Mitchell and Morris (1988) (section 3.5) this
application of parallel processing appears to be purely cluster analysis on the 1CAP 1 research machine (see Section
functional, rather than an investigation of parallelism. 3.2.2). A modi®cation of a program called CLARA for the
Wollan (1988) addressed all-subsets regression on an k-medoid method for clustering (Kaufman and Rousseeuw,
Intel iPSC hypercube (MIMD), with the purpose of assessing 1986) is described. A sequential Fortran version of the
the usefulness of parallel processing to statistics. The most program is ported to the 1CAP. Two parallelization stra-
notable feature of the parallel algorithm was the computation tegies based on a master ± slave approach are described,
of every regression model. The algorithm performed well, both of which have minimal communication overhead.
achieving near-linear speed-up. However, in comparison Neither strategy heavily utilized the host processor. Good
with the Furnival ± Wilson branch and bound algorithm performance is reported.
(Furnival and Wilson, 1974) as implemented in SAS on

3.3.2. Robust projection pursuit
a sequential machine, the sequential method gave faster
results although collinearity checking was less rigorous. de Doncker et al. (1989) reported the use of a Sun-3 network
for robust projection pursuit (Huber, 1985). R-estimates

3.2.4. Smoothing census adjustment factors
(Jaeckel, 1972) were used to make the projection component
Eddy et al. (1992) described an extensive linear algebra of the algorithm resistant to outliers. An SPMD data par-
calculation performed as part of the US Census Post titioning algorithm was constructed. de Doncker et al.
Enumeration Survey, implemented on a variety of parallel observed that for observation matrices of moderate size,
platforms in order to assess the costs and bene®ts of fast most of the algorithm's processing time is concentrated
computing environments. The analysis involved using a on computation, with little time expended on inter-processor
variance components model to adjust for errors in the communication.
census data. Three distinct systems were investigated: a

3.3.3. Econometric models
Cray Y-MP (used in SIMD mode), a Connection Machine
CM-2 (SIMD) and distributed systems (MIMD) con- Healey and Davies (1983) reported the application of an
structed from DEC UNIX workstations running under ICL DAP (Distributed Array Processor) to large-scale
LINDA (see Section 2.3). Eddy et al. described diculties Poisson regression models (McCullagh and Nelder,
porting the original SAS/IML (interactive matrix language) 1983). The parallelization of the serial iterative scaling
code to parallel versions of Fortran and C on the dierent algorithm was described. The results demonstrated that
computing platforms. Porting to the Cray was a relatively the parallel algorithm's performance improved relative
easy task. The Connection Machine port proved more to the amount of data used by the model.
dicult with programs requiring substantial modi®cation. HeÂna and Norman (1987) described the design and
Various parallel strategies were investigated for the dis- implementation of an algorithm for solving large non-linear
tributed algorithm, constructed under LINDA, none of econometric models on vector processors. A reduced
which resulted in eective speed-up. The Cray yielded Newton algorithm is vectorized with particular attention
the fastest results for the computation. given to sparse matrix operations. The method was
restructured so conventional sparse matrix operations
were completely avoided. The algorithm was coded in

3.3. Non-linear statistical methods
Fortran with calls to the IMSL libraries. Experimental
Many non-linear optimization techniques have been results were reported for both a CYBER-205 and a
implemented in parallel. Lootsma and Ragsdell (1988), Cray X/MP. Dierent properties were observed in
Lootsma (1989) and Schnabel (1988) give detailed surveys detailed analysis of the performance on the two computers.
of parallel non-linear optimization algorithms. Zenios (1989) Promising results were reported.
provides a detailed bibliography on parallel optimization.

3.3.4. Time series
3.3.1. Clustering
The subject of time series applications is addressed to some
Raphalen (1982) considered the use of a SIMD machine extent in the literature describing the application of parallel
for a hierarchical clustering algorithm. Parallelization processing to control systems.
was con®ned only to the ®rst stage, computing a distance In an early paper, Fahrmeir (1977) considered using
matrix, based on Euclidean distance. Two algorithms were parallel computers for estimating stochastic parameters
considered for this task, their application being based on of time series models. Regression, distributed-lag and
the size of the data set and number of variables. The ARMA models were considered. A Bayesian adaptive-
implementation machine was not described. It was noted ®ltering approach was adopted because it requires little
that whilst near linear speed-up is theoretically possible, knowledge of the dispersion properties of the error variables.
communication overheads induced by the interconnection The performance of the parallel algorithm is not quanti®ed
network can reduce the achieved eciency signi®cantly. although it appears to perform well.
Kaufman et al. (1988) investigated partitioning methods in Schervish (1988) brie¯y described the parallelization of
a time series model, described in detail in Schervish and integrated in dierent directions in parallel. On a sequent-
Tsay (1988). Bayesian models were constructed for auto- symmetry multiprocessor machine, linear speed-up was
regressive processes that allow for changes in the model. achieved. It is noted that this performance remains intact
Their method for dealing with outliers required the estima- for higher dimensional functions.
tion of many models. The possible models were divided
among the available processors. Speed-up of the order

3.5. Seminumerical methods
of four for six processors is reported.
We use the term `seminumerical' in a similar sense to

3.3.5. Hierarchical models
Thisted (1988), namely to describe algorithms that consist
Schervish (1988) brie¯y described the application of the ES86 of a large component of integer processing. This includes
distributed VAX system (see Section 3.1) to the analysis areas such as sorting and computing random numbers. A
of large hierarchical models for household crime data. detailed discussion of sorting algorithms for a variety of
The algorithm involved the repeated evaluation of four- parallel computers is given in Akl (1985). Computing random
dimensional integrals, giving a speed-up of 6 on a system numbers in parallel is a large research area and we refer the
consisting of 11 processors. Schervish notes that improving reader to Anderson (1990) for a review of common methods.
this speed-up would require extensive reprogramming. Mitchell and Morris (1988) developed a Bayesian
Schervish also reported the use of the ES86 parallel approach to the design and analysis of computational
system for a hierarchical model for analysing data from a experiments. In this case, a computational experiment
large prison inmate survey. The model required that nearly uses a computer to model a physical system, with a design
10 000 integrals be evaluated. The parallel algorithm to dictate the input to the program. A Bayesian approach
involved splitting the computation into small pieces and was used for predicting the outcome of the experiment,
distributing the pieces to each VAX processor. A network which led to specialized design procedures. Experiments
of 10 VAX stations yielded a speed-up of 8. were conducted on a Cray X-MP (SIMD), with each
experiment requiring about 45 seconds.
Schork and Hardwick (1990) presented results for par-

3.4. Numerical integration and density estimation
allelizing permutation and randomization tests on an
Numerical integration and related topics do not appear to IBM 3090-600E (MIMD) computer. Interest focused on
have received the same amount of attention as the areas testing the equality of k covariance matrices. Parallelism
described above. The subject is introduced in both Golub was exploited by making each processor perform an equal
and Ortega (1993) and Freeman and Philips (1992). Gladwell fraction of the desired number of permutations (again an
(1987) considers vectorized forms of certain one-dimensional SPMD type of approach). Near linear speed-up was
quadrature codes. de Doncker and Kapenga (1989) describe reported.
a parallel multivariate numerical integration algorithm Many writers, including Stewart (1988), Stine and
suitable for certain MIMD computers. Some performance Woteki (1989) and Kaufman et al. (1988) suggest that
results are given for MIMD adaptive quadrature codes in the bootstrap (Efron and Tibshirani, 1993) is an excellent
de Doncker and Vakalis (1993). Golub and Ortega observe candidate for parallel implementation. Xu and Shiue
that for many integrals the power of parallel processing is (1991) described two examples of parallelizing bootstrap
unnecessary. However, in certain statistical areas where con®dence intervals on an Intel iPSC/2 hypercube
high-dimensional integrals occur, such as Bayesian methods, (MIMD). The parallel algorithm divides the bootstrap
the increased processing power of parallel computers may be sampling equally among the nodes. Each node computes
very useful. parameters sequentially. Parallel sort and search methods
Sylwestrowicz (1982) gave examples of Monte Carlo are then applied across the nodes to generate the required
methods in statistics, implemented on an ICL DAP percentiles for the con®dence intervals. Speed-up depended
(SIMD). An ecient method of pseudorandom number on which type of con®dence interval was constructed. For
generation for a SIMD machine was given. A performance both algorithms the dominant computations were sorting
model for evaluation of a simple one-dimensional integral and searching. Xu and Shiue discussed a speed-up model
using the parallel Monte Carlo approach suggests excellent that gives an upper bound for expected speed-up.
speed-up. Sylwestrowicz asserts that simulation of this kind
with independent calculations (SPMD) is well suited to

3.6. Other
parallel computers, although algorithms for large problem
size may be inappropriate for small problems. This section brie¯y introduces material not suitable for
O'Sullivan and Pawitan (1993) brie¯y described a parallel inclusion above. Some of these areas cover the more
approach to multidimensional density estimation by tomo- obscure links between computer science and statistics.
graphy. The approach involved ®ltered backprojection and Freisleben (1993) describes parallel neural network
is readily parallelized. One-dimensional functions were algorithms for extracting principal components directly
from data. BaÈck and Homeister (1994), discuss the use With the exception of SAS, no statistics packages
of parallel processing in their introduction to genetic have been implemented in optimized form on parallel
algorithms. Parallel aspects of Markov chain Monte computers.
Carlo methods are discussed in Malfait et al. (1993). The Parallel methods may not be appropriate for interactive
successful use of SIMD hardware for image processing is computing, thereby making them ineective for rou-
described by Grenander and Miller (1994). tine analysis. In particular, some MIMD systems are
not readily adaptable to the interactive computing
needs of routine statistical analysis. There are two pro-
blems associated with this issue. First, coding software

4. Discussion and conclusion
for interactive analysis could be exceedingly dicult.
Second, poor processor utilization is likely to result

In this review we have been speci®c in the choice of material
from interactive analysis.
we have presented. The area we have concentrated on is the
application of parallel processing to statistical methods.

Lack of libraries of parallel routines (on MIMD
machines) to ease the task of programming. Libraries

For other areas, not explicitly statistical, we have given
of mathematical and statistical routines can be an
indicative references in the hope that interested readers
indispensable tool. Both NAG and IMSL libraries
will know where to turn. The review material demonstrates
are implemented on certain SIMD computers. Imple-
that parallel processing has been investigated for, and
menting the BLAS (Basic Linear Algebra Subroutines)
applied to, a wide range of statistical methods. While the
fully on MIMD machines is an active research area,
methods and applications are diverse, a common reason
requiring a set of communications subroutines,
for using parallel machines is processing speed. Many of
BLACS (Basic Linear Algebra Communications
the applications have been developed in parallel versions
Subroutines).
of Fortran. The complexity of the developed algorithms
ranges from the simple, based on summations, to the highly

Parallel computers can be very dicult to program and
debug. Both SIMD and MIMD computers require

complex, such as Wollan's algorithm for all subsets regres-
extra programming skills. In most cases SIMD
sion. Many of the MIMD applications have been parallelized
machines require less detailed knowledge of parallel
by a similar approach, namely the algorithm is decomposed
computation and oer more powerful system support
into relatively independent subsets and each subset assigned
tools. MIMD programming is complex when much
to a processor (the SPMD model; see Section 2.6). Perfor-
communication is required, although library routines
mance problems were typically induced by load balancing
for communication could be developed.
diculties and small data sets. Little attention seems to
have been paid to the accuracy and stability of parallel

The wide variety of parallel architectures. Many dierent
designs of parallel computer exist, each with unique

algorithms. Indeed this seems to be an area oering
computational properties. The availability of parallel
much scope for research.
computers to statisticians may be dictated by cost and
Many dierent types of parallel computer have been
not by how appropriate the computer is to the statisti-
investigated for statistical applications. The use of parallel
cian. Note that in general SIMD machines require less
machines appears to be demand driven and little uni®ed
eort to obtain good performance, although such
eort has been made to utilize parallel computers more
machines are generally higher in cost than MIMD
eectively. Perhaps the vectorization of the SAS package
machines.
will start to change this situation.
We attribute the limited use of parallel computers to the

The parallel architecture must be suited to the
application. For the best results a statistical problem

following:
must be expressible in a form that can be implemen-
Novelty of parallel computers. Parallel computers are ted in an ecient manner on the particular parallel
not widely available, nor have they achieved wide- computer being used. The task of developing algo-
spread commercial success. Indeed, many hardware rithms into the appropriate form can be complicated
and software issues in parallel processing remain active and time consuming.
research topics.
Modern sequential computers provide sucient power to For a statistical application to justify a parallel solution,
drive standard packages for most statistical problems. it must require facilities not available on conventional com-
Statistical computing can generally be accommodated puters. Typically the requirement is speed, although it could
adequately by conventional computers. Even if a job also be large amounts of memory or disk space. Currently
takes a long time, it is easy for the statistician to leave the use of parallel computers for statistical applications
a computer running overnight or over a weekend. has a potentially long software development time. Hence,
The current absence of standard packages. Most sta- the statistician should think carefully about the return
tistical computing is based on statistical packages. from parallel computers before using them. In particular
the cost of development on a parallel computer must be Cray-1 Computer Systems (1981) Fortran (CFT) reference manual.
balanced against the performance bene®ts likely to ensue. Publication No. SR-0009, Rev. H.
de Doncker, E. and Kapenga, J. (1989) Parallel multivariate

At present (many) parallel computers are inappropriate
numerical integration. In G. Rodrigue (ed.), Parallel Processing
for routine data analysis. This may start to change with
for Scienti®c Computing, pp. 109±13. SIAM, Philadelphia.
the vectorization of the SAS system, although use of such
de Doncker, E. and Vakalis, I. (1993) Convergence results and
software would still be constrained by the availability of the
speedup of parallel numerical integration algorithms. In R.
hardware. For the patient researcher who has a computer-
F. Sincovec, D. E. Keys, M. R. Leuze, L. R. Petzold and
intensive problem, however, the increased power of parallel
D. A. Reed (eds), Parallel Processing for Scienti®c Computing,
machines may yield large returns. Vol. 2, pp. 539±45. SIAM, Philadelphia.
The wide availability of workstation clusters and other de Doncker, E., Kapenga, J. A., and McKean, J. W. (1989) Robust
networked systems oers the most immediate prospect projection pursuit. In K. Berk and L. Malone (eds) . Computer
of parallel processing hardware to statisticians. Specialist Science and Statistics. Proceedings of the 21st Symposium on
parallel computers will undoubtedly become more widely the Interface, pp. 308 ± 13. American Statistical Association.
Dongarra, J. J. and Sorenson, D. C. (1987) A portable environment

available in the future. Whether statisticians adopt them
for developing parallel Fortran programs. Parallel Computing,
for use depends, we believe, upon the facilities and tools
5, 139±54.
developed by hardware and in particular statistical software
Dongarra, J. J. and Tourancheau, B. (1992) Environments and
manufacturers.
Tools for Parallel Scienti®c Computing. North-Holland,
Amsterdam.
Dongarra, J. J., Du, I. S., Sorenson, D. C. and van der Vorst, H. A.
(1991) Solving Linear Systems on Vector and Shared Memory

Acknowledgement Computers, SIAM, Philadelphia.
Du Croz, J. (1990) Supercomputing with the NAG Library.
We are grateful to the anonymous referees for their useful Supercomputer, 7(2), 72 ± 80.
comments which lead to a much improved paper. Earlier Durst, M. J. (1987) Library software in the supercomputing
environment. In R. M. Heiberger (ed.), Computer Science

drafts bene®ted from the constructive criticism of
and Statistics, Proceedings of the 19th Symposium on the
Professor Roger Payne of Rothamsted Experimental
Interface, pp. 7 ± 12. American Statistical Association.
Station.
Eddy, W. F. (1986) Parallel architecture: a tutorial for statisticians.
In T. M. Boardman and I. M. Stefanski (eds), Computer
Science and Statistics, Proceedings of the 18th Symposium on
References the Interface, pp. 23±9. American Statistical Association.
Eddy, W. F. and Schervish, M. J. (1986) Discrete-®nite inference
Akl, S. (1985) Parallel Sorting Algorithms. Academic Press, New on a network of Vaxes. In T. M. Boardman and I. M.
York. Stefanski, (eds), Computer Science and Statistics, Proceedings
Al-Jumeily, D. M., Clegg, D. B., Pountney, D. C. and Harris, P. of the 18th Symposium on the Interface, pp. 30 ± 6. American
(1994) Optimising Simple Statistical Calculations Using Mem- Statistical Association.
ory Computers. No. CMS 5, School of Computing and Math- Eddy, W. F., Meyer, M. M., Mockus, A., Schervish, M. J., Tan,
ematical Sciences, Liverpool John Moores University. K. and Viele, K. (1992) Smoothing census adjustment factors:
Anderson, S. L. (1990) Random number generators on vector an application of high performance computing. In H. J.
computers and other advanced architectures. SIAM Review, Newton (ed.), Computing Science and Statistics, Proceedings
32(2), 221 ± 51. of the 24th Symposium on the Interface, pp. 503 ± 10. American
BaÈck, T. and Homeister, F. (1994) Basic aspects of evolution Statistical Association.
strategies. Statistics and Computing, 4, 51 ± 63. Eddy, W. F. and Schervish, M. J. (1991) Parallel computing Ð a
Bailey, D. H. (1991) Twelve ways to fool the masses when giving tutorial for statisticians. In E. M. Keramidas (ed.), Computing
performance results on parallel computers. Supercomputer, Science and Statistics, Proceedings of the 23rd Symposium
8(5), 4 ± 7. on the Interface, pp. 479 ± 86. Interface Foundation North
Bertsekas, D. P. and Tsitsiklis, J. N. (1989) Parallel and Distributed America.
Computation, Prentice-Hall, Englewood Clis, NJ. Efron, B. and Tibshirani, R. J. (1993) An Introduction to the Boot-
Brophy, J. F., Gentle, J. E., Li, J. and Smith, P. W. (1989) Soft- strap. Chapman and Hall, London.
ware for advanced architecture computers. In K. Berk and Encore (1988) Encore Parallel Fortran, Ref. No. 724 ± 06785,
L. Malone (eds), Computer Science and Statistics, Proceedings Encore Computer Corporation, Fort Lauderdale, FL.
of the 21st Symposium on the Interface, pp. 116±20. American Fahrmeir, L. (1977) Parallel estimation algorithms for stochastic
Statistical Association. parameters of time series models. In L. Feilmeier (ed.) Parallel
Carriero, N. and Gelernter, D. (1989) LINDA in context. Commu- Computers Ð Parallel Mathematics, pp. 99 ± 102. North-
nications of the ACM, 32(4), 444 ± 58. Holland, Amsterdam.
Chambers, J. M. (1977) Computational Methods for Data Analysis. Flynn, M. J. (1972) Some computer organisations and their
Wiley, New York. eectiveness. IEEE Transactions on Computers, 21(9), 948±60.

Freeman, T. L. and Philips, C. (1992) Parallel Numerical Algorithms. Jaeckel, L. A. (1972) Estimating regression coecients by minimis-
Prentice-Hall, Englewood Clis, NJ. ing the dispersion of the residuals. Annals of Mathematical
Freisleben, B. (1993) Parallel learning algorithms for principal Statistics, 43, 1449±58.
component extraction. In Proceedings of the 3rd International Kapenga, J. A. and McKean, J. W. (1987) The vectorisation of
Conference on Arti®cial Neural Networks, 372, 267±71. algorithms for R-estimates in linear regression. In R. M.
Furnival, G. M. and Wilson, R. W., Jr. (1974) Regression by leaps Heiberger (ed.), Computer Science and Statistics, Proceed-
and bounds. Technometrics, 16, 299 ± 511. ings of the 19th Symposium on the Interface, pp. 502 ± 5.
Geist, A., Beguelin, A., Dongarra, J., Weichang, J., Manchek, R. American Statistical Association.
.
and Sunderam, V. (1993) PVM 3 0 User's Guide and Reference Kaufman, L. and Rousseeuw, P. J. (1986) Clustering large data sets.
Manual. Tech. Rept. ORNL/TM-12187, Oak Ridge National In E. Gelsema and L. Kanal (eds), Pattern Recognition in Prac-
Laboratory. tice II, pp. 425±37. Elsevier/North-Holland, Amsterdam.
Geist, A., Beguelin, A., Dongarra, J., Jiang, W., Manchek, R. and Kaufman, L., Hopke, P. K. and Rousseeuw, P. J. (1988) Using a
Sunderam, V. (1994) PVM: Parallel Virtual Machine Ð A parallel computer system for statistical resampling methods.
Users' Guide and Tutorial for Networked Parallel Computing. Computational Statistics Quarterly, 2, 129 ± 41.
MIT Press, Cambridge, MA. (also available online http:// Kaufmann, W. J. and Smarr, L. L. (1993) Supercomputing and the
www.netlib.org/pvm3/book/pvm-book.html). Transformation of Science. Scienti®c American Library.
Gladwell, I. (1987) Vectorisation of one dimensional quadrature Kleijnen, J. P. C. (1990) Supercomputers for Monte Carlo Simula-
codes. In G. Fairweather and P. M. Keast (eds), Numerical tion: Cross-validation versus Rao's test in multivariate analysis.
Integration. Recent Developments, Software and Applications, In K. H. Jockes, G. Rothe and W. Sendler (eds), Bootstrapping
NATO ASI Series C203, pp. 230 ± 8. and Related Techniques, pp. 233±45. Springer-Verlag, Berlin.
Golub, G. and Ortega, J. M. (1993) Scienti®c Computing an Kleijnen, J. P. and Annink, B. (1992) Vector computers, Monte
Introduction with Parallel Computing. Academic Press, Carlo simulation and regression analysis: an introduction.
New York. Management Science, 38(2), 170 ± 81.
Gonzalez, C., Chen, J. and Sarma, J. (1988) A tool to generate Lafaye de Micheaux, D. (1984) Parallelization of algorithms in
FORTRAN parallel code for the Intel IPSC/2 Hypercube. the practice of statistical data. In T. HavraÂ nek, Z. Sidak
In E. J, Wegman, D. T. Gantz and J. J. Miller (eds). Computer and M. Novak (eds), COMPSTAT '84 Ð Proceedings in
Science and Statistics. Proceedings of the 20th Symposium on the Computational Statistics, pp. 293 ± 300. Vienna.
Interface, pp. 214±9. American Statistical Association. Lewis, T. G. and El-Rewini, H. (1992) Introduction to Parallel
Grenander, U. and Miller, M. I. (1994) Representation of know- Processing, Prentice-Hall, Englewood Clis, NJ.
ledge in Complex Systems. Journal of the Royal Statistical Lootsma, F. A. (1989) Parallel Non Linear Optimisation. No. 89 ±
Society, Series B, 54(4), 549 ± 603. 45 Faculty of Tech. Math. and Informatics, Delft University
HavraÂ nek, T. and StratkosÏ, Z. (1989) On practical experience of Tech.
with parallel processing of linear models. Bulletin of the Lootsma, F. A. and Ragsdell, K. M. (1988) State-of-the-art in
International Statistical Institute, 53, 105 ± 17. parallel nonlinear optimisation. Parallel Computing, 6,
Hawkins, D. M., Simono, J. S. and Stromberg, A. J. (1994) 133 ± 55.
Distributing a computationally intensive estimator: the Malfait, M., Roose, D. and Vandermeulen, D. (1993) A convergence
case of exact LMS regression. Computational Statistics, 9, measure and some parallel aspects of Markov chain Monte
83 ± 95. Carlo algorithms. In Su-Shing Chen (ed.), Neural and
Healey, A. R. and Davies, S. T. (1983) Statistical model ®tting Stochastic Methods in Image and Signal Processing, Proc.
on the ICL distributed array processors. In M. Feilmeier, SPIE 2032, 23 ± 34.
J. Joubert and U. Schendel (eds), Parallel Computing '83 McCullagh, P. and Nelder, J. A. (1983) Generalised Linear Models.
pp. 311 ± 17, Elsevier, Amsterdam. Chapman and Hall, London.
HeÂna, P. J. and Norman, A. L. (1987) Solving nonlinear econo- McKean, J. W. and Hettmansperger, T. P. (1978) A robust analysis
metric models using vector processors. In T. M. Boardman of the general linear model based on one step R-estimates.
and I. M. Stefanski (eds), Computer Science and Statistics, Biometrika, 65, 571±9.
Proceedings of the 18th Symposium on the Interface, pp. Mitchell, T. J. and Beauchamp, J. J. (1986) Algorithms for Bayesian
348 ± 51. American Statistical Association. variable selection in regression. In T. M. Boardman (ed.), Com-
Hockney, R. W. and Jesshope, C. R. (1988) Parallel Computers 2. puter Science and Statistics, Proceedings of the 18th Symposium
Adam Hilger, Bristol. on the Interface, pp. 181±2. American Statistical Association.
Huber, P. J. (1985) Projection pursuit. Annals of Statistics, 13, Mitchell, T. J. and Morris, M. D. (1988) A Bayesian approach to
435 ± 525. the design and analysis of computational experiments. In
Hwang, K. (1993) Advanced Computer Architecture: Parallelism. E. J. Wegman, D. T. Gantz and J. J. Miller (eds), Computer
Scalability, Programmability. McGraw-Hill, New York. Science and Statistics. Proceedings of the 20th Symposium on
Ihnen, L. (1989) Vectorisation of the SAS(R) System. In K. Berk the Interface, pp. 49 ± 51. American Statistical Association.
and L. Malone (eds), Computer Science and Statistics. Pro- Modi, J. J. (1988). Parallel Algorithms for Matrix Computations.
ceedings of the 21st Symposium on the Interface, pp. 121 ± 7. Clarendon Press, Oxford.
American Statistical Association. O'Sullivan, F. and Pawitan, Y. (1993) Multidimensional density
Inmos (1990) Transputer Development System (2nd edn.). Prentice- estimation by tomography. Journal of the Royal Statistical
Hall, Englewood Clis, NJ. Society, Series B, 55(2), 509 ± 21.

Ortega, J. M., Voigt, R. G. and Romine, C. H. (1990) A biblio- Computer Science and Statistics, Proceedings of the 18th
graphy on parallel and vector numerical algorithms. In K. Symposium on the Interface, pp. 11 ± 14. American Statistical
A. Gallivan, M. T. Heath, E. Ng, et al. Parallel Algorithms Association.
for Matrix Computations, pp. 125 ± 97. SIAM, Philadelphia. Stewart, G. W. (1988) Parallel linear algebra in statistical compu-
Ostrouchov, G. (1987) Parallel computing on a hypercube: an tations. In D. Edwards and N. E. Raun (eds), COMPSTAT
overview of the architecture and some applications. In R. M. '88, Proceedings in Computational Statistics, pp. 3±14. Phy-
Heiberger (ed.), Computer Science and Statistics, Proceedings sica-Verlag, Vienna.
of the 19th Symposium on the Interface, pp. 27±32. American Stine, R. A. and Woteki, T. H. (1989) A graphical programming
Statistical Association. environment for statistical simulations with parallel processing.
Perrott, R. H. (1987) Parallel Programming. Addison-Wesley, In ASA Proceedings of the Statistical Computing Section, pp.
Reading, MA. 104±9. American Statistical Association.
Quinn, M. J. (1987) Designing Ecient Algorithms for Parallel StratkosÏ, Z. (1987) Eectivity and optimizing algorithms and pro-
Computers. McGraw-Hill, New York. grams on the host-computer/array processor systems. Paral-
Raphalen, M. (1982) Applying parallel processing to data analysis: lel Computing 4, 197 ± 207.
computing a distance's matrix on an SIMD machine. In Sylwestrowicz, J. D. (1982) Parallel processing in statistics. In
H. Caussinus, P. Ettinger and R. Tomassone (eds), H. Caussinus, P. Ettinger and R. Tomassone (eds),
COMPSTAT '82 Ð Proceedings in Computational Statistics, COMPSTAT '82 Ð Proceedings in Computational Statistics,
pp. 382 ± 6. Physica-Verlag, Vienna. pp. 131 ± 6. Physica-Verlag, Vienna.
Rousseeuw, P. J. (1984) Least median of squares regression. Journal Thisted, R. A. (1988) Elements of Statistical Computing. Chapman
of the American Statistical Association, 79, 871±80. and Hall, London.
Schervish, M. J. (1988) Applications of parallel computation to Wilson, G. V. (1993) A glossary of parallel computing
statistical inference. Journal of the American Statistical terminology. IEEE Parallel and Distributed Terminology,
Association, 83(404), 976 ± 83. February, pp. 52 ± 67.
Schervish, M. J. and Tsay, R. S. (1988) Bayesian modelling and Wollan, P. (1988) All-subsets regression on a hypercube computer.
forecasting in large scale time series. In J. C. Spall (ed.), Baye- In E. J. Wegman, D. T. Gantz and J. J. Miller (eds), Computer
sian Analysis of Time Series and Dynamic Models, pp. 23 ± 52. Science and Statistics. Proceedings of the 20th Symposium on the
Marcel Dekker, New York. Interface, pp. 224±7. American Statistical Association.
Schnabel, R. B. (1988) Sequential and parallel methods for uncon- Xu, C. W. and Shiue, W. K. (1991) Parallel bootstrap and inference
strained optimization. Tech. Rept. CU-CS-414-88, Dept. of for means. Computational Statistics Quarterly, 3, 233±9.
Comput. Sci., University of Colorado at Boulder, CO. Xu, C. W. and Shiue, W. K. (1993) Parallel algorithms for least
Schork, N. J. and Hardwick, J. (1990) Supercomputer-intensive median of squares regression. Computational Statistics and
multivariable randomization tests. In C. Page and R. LePage Data Analysis, 16, 349 ± 62.
(eds), Computing Science and Statistics, Proceedings of the 22nd Xu, M., Miller, J. J. and Wegman, E. J. (1989) Parallelizing
Symposium on the Interface, pp. 509±13. Springer-Verlag, New mutiple linear regression for speed and redundancy: an
York. empirical study. In K. Berk and L. Malone (eds), Computer
Skvoretz, J., Smith, S. A. and Baldwin, C. (1992) Parallel processing Science and Statistics. Proceedings of the 21st Symposium on
applications for data analysis in the social sciences. Concur- the Interface, pp. 138 ± 44. American Statistical Association.
rency: Practice and Experience, 4(3), 207±21. Zenios, S. A. (1989) Parallel numerical optimization: current
Stewart, G. W. (1986) Communication in parallel algorithms: an status and an annotated bibliography. Operational Research
example. In T. M. Boardman and I. M. Stefanski (eds), Society of America Journal of Computing, 1, 20 ± 43.

A Review of Parallel Processing For Statistical Computation: Statistics and Computing (1996) 6, 37 49

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Review of Parallel Processing For Statistical Computation: Statistics and Computing (1996) 6, 37 49

Uploaded by

Copyright:

Available Formats

Statistics and Computing (1996) 6, 37 ± 49

A review of parallel processing for

N. M. ADAMS, S. P. J. KIRBY, P. HARRIS and D. B. CLEGG

Department of Statistics, Faculty of Mathematics and Computing, The Open University,

Walton Hall, Milton Keynes, MK7 6AA, UK

Received September 1994 and accepted July 1995

material and considering some possibilities for the future.

Keywords: Parallel processing, statistical computing, Flynn's taxonomy, parallel software

tools, parallel performance

1. Introduction computational properties of many of these designs are

well understood. Most of the basic tools of statistical

mathematical formulation are only brie¯y considered.

computation for graphics and symbolic computation.

Parallel processing, that is the use of parallel computers, 2. Parallel processing

is a vast ®eld. Quinn (1987) suggested that the 1990s would

development tools. interconnection network.

0960-3174 # 1996 Chapman & Hall

some eort towards introducing statisticians to parallel

computing by presenting tutorial-style introductory

papers. Ostrouchov (1987) describes computers with

hypercube architecture and gives some applications.

Lewis and El-Rewini (1992) gives an excellent general

introduction to parallel computing. For a more detailed

description of hardware see either Hwang (1993) or Hockney

and Jesshope (1988). Detailed texts on parallel computation

include Freeman and Philips (1992) and Bertsekas and

Tsitsiklis (1989). Dongarra et al. (1991) is devoted to a

discussion of linear algebra on shared memory parallel

computers and includes detailed performance measures

 SISD: Single Instruction Single Data stream. A con-

vector processors and systolic arrays.

detail. lined vector processors. Some more recent systems,

such as the Cray-2 and Convex C3800, consist of a

set of vector processors.

The term topology refers to the manner in which processors

in MIMD and certain SIMD systems are linked. Some sys-

tems have a single topology enforced by their construction.

Others, most notably transputer systems, allow the recon-

®guration of communications links. Various topologies

exist including linear chains, rings, meshes and hypercubes;

see Fig. 3 for some examples. Each topology has unique

computation and communication properties. For more

details of parallel computer topology see Bertsekas and

Distributed computing systems are an important member

of the MIMD category. These parallel systems are con-

stations together via a network and using the resulting

statistical applications, such as in the analysis of hierarchical

models and discrete ®nite inference (Schervish, 1988; Eddy

and Schervish, 1986) and exact least median of squares

2.2. MIMD regression (Hawkins et al., 1994). Distributed systems are

often constructed from dierent types of computer, in

own memory, although memory could alternatively be 2.3. Software development

separate processing elements that are in some way connected.

An individual processor may itself be a scalar or vector

processor, or even a multiprocessor. Multiprocessor systems

can be classi®ed into two groups: shared memory and local

memory systems, although this distinction is blurred in

some modern systems.

 Shared memory systems incorporate a set of processing

connection, share access to a common memory area.

Examples of shared memory multiprocessor systems

include the Cray-2, the Encore Multimax and the Alliant

 Local or distributed memory systems incorporate multiple

some eort towards introducing statisticians to parallel

SISD: Single Instruction Single Data stream. A con-

often constructed from dierent types of computer, in

Shared memory systems incorporate a set of processing

Local or distributed memory systems incorporate multiple

2.4.2. Speed-up a SIMD computer is not as dicult as designing ecient

eciency is usually taken as Sp over p and is a measure

s r =p processing available for statistical analysis is based on