Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Statistics and Computing (1996) 6, 37 ± 49

A review of parallel processing for

statistical computation

N. M. ADAMS, S. P. J. KIRBY, P. HARRIS and D. B. CLEGG

Department of Statistics, Faculty of Mathematics and Computing, The Open University,

Walton Hall, Milton Keynes, MK7 6AA, UK

Received September 1994 and accepted July 1995

Parallel computers di€er from conventional serial computers in that they can, in a variety of ways,

perform more than one operation at a time. Parallel processing, the application of parallel

computers, has been successfully utilized in many ®elds of science and technology. The purpose

of this paper is to review e€orts to use parallel processing for statistical computing. We present

some technical background, followed by a review of the literature that relates parallel computing

to statistics. The review material focuses explicitly on statistical methods and applications,

rather than on conventional mathematical techniques. Thus, most of the review material

is drawn from statistics publications. We conclude by discussing the nature of the review

material and considering some possibilities for the future.

Keywords: Parallel processing, statistical computing, Flynn's taxonomy, parallel software

tools, parallel performance

1. Introduction computational properties of many of these designs are

well understood. Most of the basic tools of statistical

We present a review of the literature pertaining to the explicit computing, including linear algebra, sorting and random

application of parallel processing to statistics. Literature number generation have been implemented on parallel

of this type is scarce, despite many authors (including computers, though not necessarily explicitly for the purpose

Sylwestrowicz, 1982 and HavraÂnek and StratkosÏ, 1989) of statistical computing. Section 3 provides a review of the

commenting on the utility of parallel processing for statis- available literature relating parallel processing to statistics,

tics. The review was conducted in order to locate the together with references to the parallelization of major

available literature in this ®eld and to isolate potential numerical methods. Some details of the literature search

research areas. The elements of statistical computation methods employed in this review are also brie¯y described.

that we consider are limited explicitly to numerical statis- Finally, Section 4 contains some closing comments about

tical algorithms and statistical applications. Aspects such the nature of the review material and the likely prospects

as linear algebra, optimization, and quadrature in the for parallel processing in statistics.

mathematical formulation are only brie¯y considered.

Also outside the scope of the review are such areas as parallel

computation for graphics and symbolic computation.

Parallel processing, that is the use of parallel computers, 2. Parallel processing

is a vast ®eld. Quinn (1987) suggested that the 1990s would

be the decade of the parallel computer. Certainly, parallel Parallelism is the process of performing tasks concurrently,

computers provide computational power to solve otherwise that is, more than one task per unit time. A parallel computer

intractable problems (Kaufmann and Smarr, 1993). In is a computer that has the ability to exploit parallelism

Section 2 we give a brief introduction to parallel processing, incorporated in its architecture. This architecture usually

including outline descriptions of hardware and software consists of a collection of processing units coupled by an

development tools. interconnection network.

Many di€erent designs of parallel computer exist. The Eddy (1986) and Eddy and Schervish (1991) have made

0960-3174 # 1996 Chapman & Hall


38 Adams, Kirby, Harris and Clegg

some e€ort towards introducing statisticians to parallel

computing by presenting tutorial-style introductory

papers. Ostrouchov (1987) describes computers with

hypercube architecture and gives some applications.

Lewis and El-Rewini (1992) gives an excellent general

introduction to parallel computing. For a more detailed

description of hardware see either Hwang (1993) or Hockney

and Jesshope (1988). Detailed texts on parallel computation

include Freeman and Philips (1992) and Bertsekas and

Tsitsiklis (1989). Dongarra et al. (1991) is devoted to a

discussion of linear algebra on shared memory parallel

computers and includes detailed performance measures


Fig. 1. Diagrammatic representation of an SIMD machine. C, con-
for a variety of machines as well as an extensive
trol unit; P, processors; S, store (memory)
glossary. Wilson (1993) gives a useful glossary of parallel

computing terminology. whereby processors are independent but can access all
A variety of parallel computers exist with radically varying available memory. A single control unit drives the proces-
architecture. Attempts to classify the di€erent designs have sing units. SIMD computers typically execute a single
not been completely successful (Hockney and Jesshope, stream of instructions with a number of simple processing
1988) and no universally accepted classi®cation scheme exists. units, each performing the same instructions on its own
In this paper we use Flynn's taxonomy (Flynn, 1972), which data. At a given time, the same instruction is being executed
classi®es computers according to how the machine relates on a collection of processors with each processor manipulat-
its instructions to the data being processed. We choose ing di€erent data. All computers in this class therefore have
Flynn's classi®cation because it is adequate for describing synchronous operation, that is, access to shared memory is
parallel algorithms as well as hardware, and it is widely tightly coordinated. This is the most useful model for
used (Freeman and Philips, 1992). Within Flynn's taxon- massively parallel (incorporating many processors) scien-
omy, a stream is a sequence of items (instructions or data) ti®c computing with many engineering and scienti®c tasks
executed or operated on by a processor. There are four falling naturally in this class including image processing,
broad categories, dealing with single and multiple streams particle simulation and ®nite element methods (Lewis
of items: and El-Rewini, 1992).

 SISD: Single Instruction Single Data stream. A con-


The SIMD model includes: array processors, pipelined

vector processors and systolic arrays.


ventional sequential computer. Each instruction initiates

an operation.
 Array processors consist of a set of elementary proces-
 SIMD: Single Instruction Multiple Data stream. A
sing units connected by a grid, which is usually square.
computer that has a single stream of instructions that
These machines are well suited to computations
initiate operations on many streams of data.
involving matrix manipulations. Examples include
 MISD: Multiple Instruction Single Data stream. This
the AMT Distributed Array Processor and Thinking
category illustrates the problems inherent in interpreting
Machines' Connection Machine. An attached-array is
Flynn's taxonomy. Hwang (1993) puts systolic arrays in
simply a conventional computer with array hardware
this class, while other authors assert that no computers
connected.
fall in this category.
 Pipelined vector processors achieve parallelism by two
 MIMD: Multiple Instruction Multiple Data stream.
methods. First, arithmetic operations (addition, multi-
A computer with several processing units capable
plication, etc.) are broken down into individual elements
of operating on several data streams. This includes
and computed as a pipeline. Processing data as a pipeline
all forms of multiprocessor and is the most general
can be pictured as performing lower level operations as
form of parallelism.
an assembly line. Second, vector processing units, as

This classi®cation is reasonably well-de®ned although the name suggests, allow arithmetic operations

many modern parallel systems are hybrids of SIMD and between pairs of vector elements fed from vector regis-

MIMD and hence belong not to either class but to both. ters to functional units that use pipelining techniques.

We will describe the SIMD and MIMD models in more The Cray-1 and CDC Cyber 205 are examples of pipe-

detail. lined vector processors. Some more recent systems,

such as the Cray-2 and Convex C3800, consist of a

set of vector processors.


2.1. SIMD
 Systolic arrays are a highly specialized SIMD design

The SIMD model is represented pictorially by Fig. 1, that typically consist of a tightly coordinated square
Parallel processing for statistical computation 39

The term topology refers to the manner in which processors

in MIMD and certain SIMD systems are linked. Some sys-

tems have a single topology enforced by their construction.

Others, most notably transputer systems, allow the recon-

®guration of communications links. Various topologies

exist including linear chains, rings, meshes and hypercubes;

see Fig. 3 for some examples. Each topology has unique

computation and communication properties. For more

details of parallel computer topology see Bertsekas and

Tsitsiklis (1989).

Distributed computing systems are an important member

of the MIMD category. These parallel systems are con-


Fig. 2. Diagrammatic representation of an MIMD machine. C, con-

trol unit; P, processor; S, shared or local store (memory) structed by linking independent computers such as work-

stations together via a network and using the resulting

array of processors operating on data accessible from distributed system for a single task. Such systems are

memory around the perimeter of the array. becoming more popular, and have been used for some

statistical applications, such as in the analysis of hierarchical

models and discrete ®nite inference (Schervish, 1988; Eddy

and Schervish, 1986) and exact least median of squares

2.2. MIMD regression (Hawkins et al., 1994). Distributed systems are

often constructed from di€erent types of computer, in


Multiprocessor systems form the bulk of this class. Figure 2
which case they are termed heterogeneous.
is a graphical model of MIMD computers. Here, each

processor drives its own control unit and can have its

own memory, although memory could alternatively be 2.3. Software development

shared.
There is a broad range of tools for software development
A multiprocessor system consists of a number of powerful

separate processing elements that are in some way connected.

An individual processor may itself be a scalar or vector

processor, or even a multiprocessor. Multiprocessor systems

can be classi®ed into two groups: shared memory and local

memory systems, although this distinction is blurred in

some modern systems.

 Shared memory systems incorporate a set of processing


elements that, by a variety of possible methods of inter-

connection, share access to a common memory area.

Examples of shared memory multiprocessor systems

include the Cray-2, the Encore Multimax and the Alliant

FX/80.

 Local or distributed memory systems incorporate multiple


processing elements each with its own private memory.

Here processors communicate directly, and global access

to local memory is generally not allowed. Examples of

local memory systems include the Intel iPSC/1, Ametek

S14 Hypercube and Transputer systems. Whichever

kind of system is being used, processor coordination

must at some time occur. The two methods of coordina-

tion are synchronous and asynchronous. Shared memory

systems have synchronous operation since access to

globally shared memory is coordinated to prevent con-

tention. The processors in a local memory system have

asynchronous operation and coordination is achieved

by message-passing. The type of coordination is an


Fig. 3. Some conceptual diagrams of simple parallel topologies. (A)
important consideration in software and algorithm a linear chain of p processors; (B) a complete ring of p processors;

design. (C) a square mesh of 16 processors


40 Adams, Kirby, Harris and Clegg

on parallel computing systems. Unfortunately, there is determining processor synchronization points and

currently no consistent standard for parallel languages processor utilization.

or software development tools. Each class of computer


A major asset for statistical programming is the availability
requires tools that re¯ect the architecture. Additionally,
of subroutine libraries, such as NAG (Numerical Algorithms
each manufacturer's machine tends to have a bespoke soft-
Group, UK) or IMSL (International Mathematical and
ware environment that seriously compromises the portability
Statistical Libraries, Houston, Texas, USA) libraries. The
of developed code.
NAG library is implemented on certain vector supercom-
Broadly speaking, parallel languages can be divided into
puters (Du Croz, 1990). The routines are vectorized to
languages designed to accommodate parallelism and exten-
suit the architecture of the speci®c machine. The implemen-
sions to standard sequential languages. In the former group,
tation of NAG routines on supercomputers is based on
the most notable examples are occam and Ada. Parallel
the implementation of the BLAS (Basic Linear Algebra
extensions exist to certain conventional languages (Perrott
Subprograms), see Freeman and Philips (1992). Brophy
1987) including Fortran and Pascal.
et al. (1989) report that vectorized versions of the IMSL
Tools to assist in program development (Freeman and
libraries are also based on the BLAS. Durst (1987) gives
Philips, 1992) include parallelizing compilers, parallel
a detailed discussion of the use of library software for
programming environments and parallel debugging aids.
supercomputing.

A vectorized version of the SAS statistics package is


 A parallelizing compiler is capable of receiving source
available on Convex supercomputers at sites including the
code in a sequential language, typically Fortran, iden-
University of London Computing Centre. Ihnen (1989)
tifying areas of potential parallelism and producing
describes the vectorization of the SAS software. Vectorized
parallel executable code. This tool presents the oppor-
SAS utilizes BLAS routines in a similar manner to the
tunity to users to have their sequential algorithms
NAG and IMSL libraries. We know of no other major
implemented on a parallel machine without the labor-
statistics package that has been highly optimized for a
ious task of rewriting them in parallel. However,
supercomputer.
parallelizing compilers do not guarantee good perfor-

mance and some rewriting is often required for

improved results. Examples of parallelizing compilers

include the EPF (Encore Parallel Fortran) compiler 2.4. Assessing performance

(Encore, 1988) available on the Encore Multimax


Many benchmarks exist for comparing the actual speed of
and the CFT (Cray Fortran Translator) compiler
parallel computers for speci®c problems and applications
(Cray-1 Computer Systems, 1981) for the Cray-1.
(Hockney and Jesshope, 1988; Dongarra et al., 1991). To
Gonzalez et al. (1988) reports on tools for an Intel
the best of our knowledge there are no benchmarks of an
iPSC/2 Hypercube that generate parallel Fortran
explicitly statistical nature, although these would be highly
from the sequential form.
desirable for comparison of high performance computers
 Parallel programming environments are a coordinated
in this area. Of equal interest here are methods of charac-
set of tools designed to ease the burden of coding by
terizing and measuring program or algorithm perfor-
automating some of the steps in developing parallel
mance. Two terms of particular importance are grain
programs. Existing programming environments only
size and speed-up, although interpreting quoted performance
encompass part of the programming life cycle and
is often dicult, as wryly observed by Bailey (1991).
the development of such environments is in its infancy

(Lewis and El-Rewini, 1992). Two examples include


2.4.1. Grain size
SCHEDULE (Dongarra and Sorenson, 1987) and

the Transputer Development System (Inmos, 1990). Grain size is a qualitative measure of the inherent parallelism

For a detailed description of parallel environments of an algorithm. Grain size refers to the number of instruc-

see Lewis and El-Rewini, or Dongarra and Touran- tions performed in parallel before some kind of processor

cheau (1992). Environments designed to allow porta- synchronization must occur. The parallelism in a given

ble and network parallel programming include PVM algorithm may be ®ne-grain, medium-grain or coarse-grain,

(Parallel Virtual Machine) and LINDA (Carriero with ®ne-grain representing many synchronization points

and Gelernter, 1989). PVM (Geist et al., 1993, 1994) and coarse-grain representing very few. Broadly speaking,

is a widely used and freely available tool for construct- simple scalar ± vector operations are ®ne-grain, vector ±

ing parallel distributed systems. matrix operations are medium-grain and matrix ± matrix

 Parallel debugging aids are essential since methods of operations are coarse-grain. Freeman and Philips (1992)

debugging a sequential program are inadequate for note that SIMD machines are generally suited to ®ne-

parallel code. In addition to the usual debugging and medium-grain parallelism and MIMD machines are

facilities, a parallel debugger must provide tools for more suitable for coarse-grain parallelism.
Parallel processing for statistical computation 41

2.4.2. Speed-up a SIMD computer is not as dicult as designing ecient

algorithms for a MIMD computer. Many SIMD


Researchers of parallel algorithms often quote the speed-up
machines have very e€ective parallelizing compilers,
ratio for an algorithm. Speed-up is merely a measure of an

algorithm's utilization of many processors. Let Tp be the


although to achieve optimal performance some extra pro-

execution time on p processors. The speed-up ratio Sp on


gramming is usually required, often in the form of ad hoc

compiler directives. For MIMD machines the algorithm


p processors is given by
designer has to decide how to decompose the problem
T
Sp ˆ 1
: subject to a number of potentially contending issues.
Tp These include keeping all processors as busy as possible,

preventing communication and memory bottlenecks,


This quantity measures the speed-up obtained by paralleliz-
and making the algorithm as fast as possible. A number
ing a particular algorithm. The fastest sequential and parallel
of models for parallel algorithm design have evolved,
algorithms are generally used for such comparisons. Ideally

Sp should increase linearly with p, with gradient 1. This


based on how the problem is decomposed. Of interest

here is the SPMD (Single Program Multiple Data Stream)


is seldom the case since at some stage processors must
model that describes a situation where a single program
communicate. Stewart (1986) gives a detailed example
drives a number of independent processors. The SPMD
of communication within a parallel algorithm. The term

eciency is usually taken as Sp over p and is a measure


model exploits geometric parallelism, whereby the computa-

tional requirements of an algorithm are split into similar


of processor utilization.
independent tasks and each task is assigned to a processor.
Useful tools for parallel algorithm designers are speed-up
The ®nal answer is formed by collecting the intermediate
or timing models (Freeman and Philips, 1992). Using
results from each processor. Di€erent strategies exist for
knowledge of a parallel computer's performance, it is pos-
exploiting geometric parallelism, of which the most relevant
sible to construct a speed-up model that provides an estimate
to the review material is the master±slave mode. This involves
of attainable speed-up for a particular algorithm. Such
having one processor (the master) responsible for driving the
models can allow the developer to assess the performance
remaining processors (the slaves), distributing the data and
of a program without writing any code. Al-Jumeily et al.
gathering the results. The key element of this approach is
(1994) give examples of speed-up models for computing the
that the computational e€ort is split among all processors
arithmetic mean.
in the same manner. Many algorithms require some form

of load balancing. This is an attempt to distribute evenly the

2.5. Amdahl's law amount of work among available processors.

Lafaye de Micheaux (1984) considers construction of


Amdahl's law addresses the problem that a parallel algorithm
parallel MIMD algorithms for statistical data analysis,
is likely to have inherently sequential portions. It gives an
observing that many statistical techniques can be e€ectively
upper bound to obtainable speed-up. If r is the proportion
parallelized on MIMD machines using the same SPMD
of a program that may be parallelized and sˆ1ÿr is the
model. Methods of partitioning data for simple statistical
remaining inherently sequential proportion then for a system
computations are often appropriate for highly complex
with p processors the speed-up Sp satis®es situations such as projection and clustering. Lafaye de

1
Sp 
Micheaux concludes that one way to make high-speed

s ‡ … r =p† processing available for statistical analysis is based on

multiprocessor systems constructed from cheap components,


Two interpretations exist for Amdahl's law (Freeman and
using an SPMD approach with data partitioning.
Philips, 1992). First, it can be seen as imposing an upper

bound on the speed-up ratio, with the consequence that

adding processors above a certain number will not result


3. Parallel processing for statistics: a review
in improved performance. Second, Amdahl's law suggests

that a larger parallel machine should be used to solve


Exploiting parallelism for statistics is not a new idea, nor
ever-larger problems.
is it entirely ignored in major statistical texts. As early as

1977, Chambers (1977) mentioned the use of parallel com-

puting in data analysis situations where the magnitude of


2.6. Exploiting parallelism in statistical algorithms
computation makes interactive analysis impossible.

To achieve ecient performance on a parallel computer, Thisted (1988) brie¯y mentioned parallel computers as

an algorithm must be designed to suit the particular archi- being an ideal method of implementing Jacobi methods

tecture of the computer. It is in the area of algorithm for extracting eigenvalues. Additionally, in his foreword,

design that the distinction between SIMD and MIMD is Thisted commented that his second volume would con-

most acute. In general, designing ecient algorithms for tain statistical algorithms for parallel computers.
42 Adams, Kirby, Harris and Clegg

An excellent source of statistics-related parallel processing as independent subsets is amenable to parallelization

literature is the Computing Science and Statistics proceedings (Schervish, 1988).

from the annual Symposium on the Interface between Schervish (1988) considered a variety of parallel appli-

computing science and statistics. Organized annually by cations Ð including discrete Ð ®nite inference, a computer-

the Interface Foundation of North America and others, intensive approach for the analysis of discrete dataÐwhere

these proceedings regularly carry a variety of papers on the dominant aspect of the computation is simple summation

parallel processing and statistics. Material published of large sets of data. These applications allow elements of the

here relates to both general parallel processing and the computation to be divided and processed independently. The

application of parallel processing to statistics. parallel computer investigated, ES86, was a distributed sys-

This review is principally based on an initial search of tem constructed from a network of VAX computers. Paral-

two on-line databases, MATHSCI and BIDS (Bath Infor- lel discrete-®nite inference programs achieved near-linear

mation and Data Service Ð an on-line version of the ISI speed-up on the ES86 system. Improved accuracy was also

citation index). Both of these databases provide access reported for the parallel algorithm.

to journals and proceedings. We supplemented computer Skvoretz et al. (1992) employed an NCUBE/10 hypercube

searches with conventional manual search techniques. multiprocessor (MIMD) to assess the application of parallel

It is rather dicult to describe the ®eld of parallel statistical processing to typical large-scale social science research, using

computation systematically. This diculty stems from the census data for subsample selection and cross-tabulation.

fact that many statistical algorithms are constructed from They found that a crucial consideration for good parallel per-

`building-block' numerical algorithms. There is an immense formance is keeping the processors busy while disk access

amount of literature relating to numerical methods on occurs, and achieved this by distributing data to the proces-

parallel machines but we wish to focus on areas that are sors in small packets, rather than en masse (the term packet

either explicitly statistical or numerical methods investigated describes the amount of data being transmitted at a given

in the name of statistics. The review therefore proceeds in interaction). Good performance was reported. In comparison

the following manner. In each section we consider statistical with performance of the software package SPSS-X on an

computing loosely classi®ed in the general manner of the IBM 3081, the parallel method was always faster for large

chapter headings of Thisted (1988). Subsections are orga- (> 80 000 cases) data sets.

nized according to our own classi®cation based on the avail-

ability of material. The review material is divided into two


3.2. Numerical linear algebra
distinct classes: parallelization of conventional numerical

methods, and parallelization of methods speci®cally in the Much material has been published on parallel numerical

name of statistics. In the former group, we present only out- linear algebra, for example Ortega et al. (1990) give a biblio-

line details because of the amount of such material. The latter graphy relating to this subject containing 2000 entries! For

group, explicitly statistical in nature, is given more attention. excellent descriptions of parallel linear algebra see either

For all of the review material we focus attention on the Freeman and Philips (1992), Golub and Ortega (1993),

reported bene®ts and drawbacks of a parallel processing Dongarra et al. (1991) or Modi (1988). Many numerical

approach. linear algebra methods have been considered for a wide

A problem with methods developed for particular parallel variety of parallel machines. These include both iterative

computers is a lack of portability. Thus, published parallel and direct methods and approaches for sparse matrices.

algorithms are unlikely to be readily ported to machines Stewart (1988) gave an overview of how statistical linear

with di€erent architectures. A current approach is to publish algebra computations may be mapped onto di€erent parallel

algorithmic `skeletons'. The following subsections review architectures, stressing the importance of the architecture and

parallel work relating to: the central role of linear algebra in statistical computing. Of

SIMD machines Stewart concluded that although many


 basic numerical methods (3.1)
elementary processors are applied to a problem, their utility
 numerical linear algebra (3.2)
depends on algorithms requiring repetitive bursts of com-
 non-linear statistical methods (3.3)
putation. Stewart suggested various solutions to problems
 numerical integration and density estimation (3.4)
a€ecting the performance of distributed memory MIMD
 seminumerical methods (3.5)
machines. These include careful implementation of parallel
 other (3.6).
algorithms to avoid inecient coding, trying to reduce

processor start-up time and running larger jobs to try to

maximize the ratio of computation to communication.


3.1. Basic numerical methods

Parallel computers provide an ideal platform for applying 3.2.1. Linear regression

simple techniques to large data sets. Any statistical tech- HavraÂnek and StratkosÏ (1989) extended the work of Stewart

nique that allows data to be broken down and processed into a practical situation, focusing on multiple linear
Parallel processing for statistical computation 43

regression, using a host computer/attached-array processor The vectorized sorting component gave no speed-up.

(SIMD). It is noted that iterative algorithms often give Kapenga and McKean suggest that using vectorized

better parallel performance than elimination methods methods makes robust analysis of this type feasible for

(StratkosÏ , 1987). Havra nek and StratkosÏ consider various most problems of moderate size.

parallel approaches to Cholesky factorization. Interestingly, Kaufman et al. (1988) considered the application of

models including the biggest numbers of parameters, applied parallelism to resampling methods. Illustrative examples

to the largest test data set, yielded intractable computations, based on parallel algorithms for cluster analysis (see Section

due to restricted memory. Quoted speed-up results were 3.3.1) and robust regression demonstrated the applicability of

rather promising. The authors showed that their algorithm's parallelism to resampling methods. The computer investi-

performance does not depend critically on the size of data set. gated was the 1CAP 1 IBM research machine (MIMD),

Xu et al. (1989) considered multiple linear regression consisting of a host processor and 10 connected array pro-

models on an Intel iPSC/2 using an SPMD type decomposi- cessors. Attention focused on the least-median of squares

tion, for both improved performance and resistance to in robust regression, with consideration of di€erent paral-

processor failure. They also considered the importance lelization strategies for a sequential program. The chosen

of packet size during parallel communication. Xu et al. method involved an SPMD type approach. Kaufman et

report good speed-up, this being of the order of 14 for al. noted that inter-processor communication is often a

16 processors. Amdahl's law manifested itself here, as critical factor and some form of load balancing is often

14 is shown to be an asymptotic limit on speed-up. necessary.

Kleijnen (1990) employed a Cyber 205 supercomputer Xu and Shiue (1993) developed three SPMD-type parallel

(SIMD) for a Monte Carlo experiment to compare the algorithms for least median of squares (LMS) regression

performance of Rao's test for validity of a regression (Rousseeuw, 1984), using an Intel iPSC/2 MIMD computer.

model with a cross-validation approach in multivariate Parallelization strategies were deduced from consideration of

regression. The parallel code performed well for ordinary the nested loop structure of sequential implementations. Xu

least squares estimates (OLS), and for general least and Shiue made extensive use of speed-up models to assess

squares (GLS) estimates based on large samples. How- di€erent formulations. Load-balancing strategies were

ever, GLS estimates obtained from small samples actually used to improve algorithms where appropriate. An exact

resulted in speed-down, that is slower absolute time using algorithm and an approximate algorithm yielded near linear

parallelism than a single processor. speed-up, while a parallelized version of a fast sequential

Kleijnen and Annink (1992) gave a detailed description algorithm resulted in catastrophic speed-down due to severe

of the vectorization of a program for Monte Carlo simulation load imbalance.

of regression analysis. They found that OLS estimates yielded Hawkins et al. (1994) investigated various con®gurations

good parallel performance due to e€ective vectorization. for the distributed computation of exact LMS regression.

However, the matrix inversion in their GLS algorithm Distributed computation (Section 2.2) in this case means

did not bene®t from vectorization and speed-up was partitioning a sequential program onto a group of processors,

diminished relative to the OLS method. which may be independent computers, and gathering results

As part of an investigation into parallel processing for on completion. Here, the processors need not be dedicated

social science research (see Section 3.1) Skvoretz et al. to a single task. Within Flynn's taxonomy this is a distributed

(1992) experimented with large regression models. The memory MIMD computer. A list of tasks is presented to

parallel component of the computation involved computing partition the problem into a distributed solution. The

the covariance matrix in an SPMD manner. Experiments code gave good speed-up (eciency is quoted), particularly

involved computing regression models using di€erent for large sample sizes, on a variety of systems, including a

numbers of processors and reading varying amounts of 22-processor heterogeneous system that required extra

data from disk. They found that the latter consideration coding because of the added complexity of di€erent vendors'

was critical in trying to obtain good performance. hardware.

3.2.2. Robust regression 3.2.3. Subset selection

Kapenga and McKean (1987) considered vectorized algo- Mitchell and Beauchamp (1986) developed a Bayesian

rithms for the robust R-estimates of Jaeckel (1972). In method of subset selection in linear regression. A branch

particular, their purpose was to assess the vectorization and bound algorithm was implemented on a Cray X-MP

of the k-step algorithm of McKean and Hettmansperger supercomputer (SIMD). Computing large regression models

(1978). Experiments performed on a FPS-264 array pro- with many predictors motivated use of the supercomputer.

cessor (SIMD) are reported as achieving considerable With 25 predictor variables, reported as the largest test

speed-up at the highest levels of parallel optimization. case, the algorithm required one second of cpu time on

The vectorized algorithm had three major contributors the Cray. Sequential timings are not reported. Like the

a QR decomposition, a projection and a sorting stage. work of Mitchell and Morris (1988) (section 3.5) this
44 Adams, Kirby, Harris and Clegg

application of parallel processing appears to be purely cluster analysis on the 1CAP 1 research machine (see Section

functional, rather than an investigation of parallelism. 3.2.2). A modi®cation of a program called CLARA for the

Wollan (1988) addressed all-subsets regression on an k-medoid method for clustering (Kaufman and Rousseeuw,

Intel iPSC hypercube (MIMD), with the purpose of assessing 1986) is described. A sequential Fortran version of the

the usefulness of parallel processing to statistics. The most program is ported to the 1CAP. Two parallelization stra-

notable feature of the parallel algorithm was the computation tegies based on a master ± slave approach are described,

of every regression model. The algorithm performed well, both of which have minimal communication overhead.

achieving near-linear speed-up. However, in comparison Neither strategy heavily utilized the host processor. Good

with the Furnival ± Wilson branch and bound algorithm performance is reported.

(Furnival and Wilson, 1974) as implemented in SAS on


3.3.2. Robust projection pursuit
a sequential machine, the sequential method gave faster

results although collinearity checking was less rigorous. de Doncker et al. (1989) reported the use of a Sun-3 network

for robust projection pursuit (Huber, 1985). R-estimates


3.2.4. Smoothing census adjustment factors
(Jaeckel, 1972) were used to make the projection component

Eddy et al. (1992) described an extensive linear algebra of the algorithm resistant to outliers. An SPMD data par-

calculation performed as part of the US Census Post titioning algorithm was constructed. de Doncker et al.

Enumeration Survey, implemented on a variety of parallel observed that for observation matrices of moderate size,

platforms in order to assess the costs and bene®ts of fast most of the algorithm's processing time is concentrated

computing environments. The analysis involved using a on computation, with little time expended on inter-processor

variance components model to adjust for errors in the communication.

census data. Three distinct systems were investigated: a


3.3.3. Econometric models
Cray Y-MP (used in SIMD mode), a Connection Machine

CM-2 (SIMD) and distributed systems (MIMD) con- Healey and Davies (1983) reported the application of an

structed from DEC UNIX workstations running under ICL DAP (Distributed Array Processor) to large-scale

LINDA (see Section 2.3). Eddy et al. described diculties Poisson regression models (McCullagh and Nelder,

porting the original SAS/IML (interactive matrix language) 1983). The parallelization of the serial iterative scaling

code to parallel versions of Fortran and C on the di€erent algorithm was described. The results demonstrated that

computing platforms. Porting to the Cray was a relatively the parallel algorithm's performance improved relative

easy task. The Connection Machine port proved more to the amount of data used by the model.

dicult with programs requiring substantial modi®cation. HeÂna€ and Norman (1987) described the design and

Various parallel strategies were investigated for the dis- implementation of an algorithm for solving large non-linear

tributed algorithm, constructed under LINDA, none of econometric models on vector processors. A reduced

which resulted in e€ective speed-up. The Cray yielded Newton algorithm is vectorized with particular attention

the fastest results for the computation. given to sparse matrix operations. The method was

restructured so conventional sparse matrix operations

were completely avoided. The algorithm was coded in


3.3. Non-linear statistical methods
Fortran with calls to the IMSL libraries. Experimental

Many non-linear optimization techniques have been results were reported for both a CYBER-205 and a

implemented in parallel. Lootsma and Ragsdell (1988), Cray X/MP. Di€erent properties were observed in

Lootsma (1989) and Schnabel (1988) give detailed surveys detailed analysis of the performance on the two computers.

of parallel non-linear optimization algorithms. Zenios (1989) Promising results were reported.

provides a detailed bibliography on parallel optimization.


3.3.4. Time series

3.3.1. Clustering
The subject of time series applications is addressed to some

Raphalen (1982) considered the use of a SIMD machine extent in the literature describing the application of parallel

for a hierarchical clustering algorithm. Parallelization processing to control systems.

was con®ned only to the ®rst stage, computing a distance In an early paper, Fahrmeir (1977) considered using

matrix, based on Euclidean distance. Two algorithms were parallel computers for estimating stochastic parameters

considered for this task, their application being based on of time series models. Regression, distributed-lag and

the size of the data set and number of variables. The ARMA models were considered. A Bayesian adaptive-

implementation machine was not described. It was noted ®ltering approach was adopted because it requires little

that whilst near linear speed-up is theoretically possible, knowledge of the dispersion properties of the error variables.

communication overheads induced by the interconnection The performance of the parallel algorithm is not quanti®ed

network can reduce the achieved eciency signi®cantly. although it appears to perform well.

Kaufman et al. (1988) investigated partitioning methods in Schervish (1988) brie¯y described the parallelization of
Parallel processing for statistical computation 45

a time series model, described in detail in Schervish and integrated in di€erent directions in parallel. On a sequent-

Tsay (1988). Bayesian models were constructed for auto- symmetry multiprocessor machine, linear speed-up was

regressive processes that allow for changes in the model. achieved. It is noted that this performance remains intact

Their method for dealing with outliers required the estima- for higher dimensional functions.

tion of many models. The possible models were divided

among the available processors. Speed-up of the order


3.5. Seminumerical methods
of four for six processors is reported.

We use the term `seminumerical' in a similar sense to


3.3.5. Hierarchical models
Thisted (1988), namely to describe algorithms that consist

Schervish (1988) brie¯y described the application of the ES86 of a large component of integer processing. This includes

distributed VAX system (see Section 3.1) to the analysis areas such as sorting and computing random numbers. A

of large hierarchical models for household crime data. detailed discussion of sorting algorithms for a variety of

The algorithm involved the repeated evaluation of four- parallel computers is given in Akl (1985). Computing random

dimensional integrals, giving a speed-up of 6 on a system numbers in parallel is a large research area and we refer the

consisting of 11 processors. Schervish notes that improving reader to Anderson (1990) for a review of common methods.

this speed-up would require extensive reprogramming. Mitchell and Morris (1988) developed a Bayesian

Schervish also reported the use of the ES86 parallel approach to the design and analysis of computational

system for a hierarchical model for analysing data from a experiments. In this case, a computational experiment

large prison inmate survey. The model required that nearly uses a computer to model a physical system, with a design

10 000 integrals be evaluated. The parallel algorithm to dictate the input to the program. A Bayesian approach

involved splitting the computation into small pieces and was used for predicting the outcome of the experiment,

distributing the pieces to each VAX processor. A network which led to specialized design procedures. Experiments

of 10 VAX stations yielded a speed-up of 8. were conducted on a Cray X-MP (SIMD), with each

experiment requiring about 45 seconds.

Schork and Hardwick (1990) presented results for par-


3.4. Numerical integration and density estimation
allelizing permutation and randomization tests on an

Numerical integration and related topics do not appear to IBM 3090-600E (MIMD) computer. Interest focused on

have received the same amount of attention as the areas testing the equality of k covariance matrices. Parallelism

described above. The subject is introduced in both Golub was exploited by making each processor perform an equal

and Ortega (1993) and Freeman and Philips (1992). Gladwell fraction of the desired number of permutations (again an

(1987) considers vectorized forms of certain one-dimensional SPMD type of approach). Near linear speed-up was

quadrature codes. de Doncker and Kapenga (1989) describe reported.

a parallel multivariate numerical integration algorithm Many writers, including Stewart (1988), Stine and

suitable for certain MIMD computers. Some performance Woteki (1989) and Kaufman et al. (1988) suggest that

results are given for MIMD adaptive quadrature codes in the bootstrap (Efron and Tibshirani, 1993) is an excellent

de Doncker and Vakalis (1993). Golub and Ortega observe candidate for parallel implementation. Xu and Shiue

that for many integrals the power of parallel processing is (1991) described two examples of parallelizing bootstrap

unnecessary. However, in certain statistical areas where con®dence intervals on an Intel iPSC/2 hypercube

high-dimensional integrals occur, such as Bayesian methods, (MIMD). The parallel algorithm divides the bootstrap

the increased processing power of parallel computers may be sampling equally among the nodes. Each node computes

very useful. parameters sequentially. Parallel sort and search methods

Sylwestrowicz (1982) gave examples of Monte Carlo are then applied across the nodes to generate the required

methods in statistics, implemented on an ICL DAP percentiles for the con®dence intervals. Speed-up depended

(SIMD). An ecient method of pseudorandom number on which type of con®dence interval was constructed. For

generation for a SIMD machine was given. A performance both algorithms the dominant computations were sorting

model for evaluation of a simple one-dimensional integral and searching. Xu and Shiue discussed a speed-up model

using the parallel Monte Carlo approach suggests excellent that gives an upper bound for expected speed-up.

speed-up. Sylwestrowicz asserts that simulation of this kind

with independent calculations (SPMD) is well suited to


3.6. Other
parallel computers, although algorithms for large problem

size may be inappropriate for small problems. This section brie¯y introduces material not suitable for

O'Sullivan and Pawitan (1993) brie¯y described a parallel inclusion above. Some of these areas cover the more

approach to multidimensional density estimation by tomo- obscure links between computer science and statistics.

graphy. The approach involved ®ltered backprojection and Freisleben (1993) describes parallel neural network

is readily parallelized. One-dimensional functions were algorithms for extracting principal components directly
46 Adams, Kirby, Harris and Clegg

from data. BaÈck and Ho€meister (1994), discuss the use With the exception of SAS, no statistics packages

of parallel processing in their introduction to genetic have been implemented in optimized form on parallel

algorithms. Parallel aspects of Markov chain Monte computers.

Carlo methods are discussed in Malfait et al. (1993). The  Parallel methods may not be appropriate for interactive

successful use of SIMD hardware for image processing is computing, thereby making them ine€ective for rou-

described by Grenander and Miller (1994). tine analysis. In particular, some MIMD systems are

not readily adaptable to the interactive computing

needs of routine statistical analysis. There are two pro-

blems associated with this issue. First, coding software


4. Discussion and conclusion
for interactive analysis could be exceedingly dicult.

Second, poor processor utilization is likely to result


In this review we have been speci®c in the choice of material
from interactive analysis.
we have presented. The area we have concentrated on is the

application of parallel processing to statistical methods.


 Lack of libraries of parallel routines (on MIMD

machines) to ease the task of programming. Libraries


For other areas, not explicitly statistical, we have given
of mathematical and statistical routines can be an
indicative references in the hope that interested readers
indispensable tool. Both NAG and IMSL libraries
will know where to turn. The review material demonstrates
are implemented on certain SIMD computers. Imple-
that parallel processing has been investigated for, and
menting the BLAS (Basic Linear Algebra Subroutines)
applied to, a wide range of statistical methods. While the
fully on MIMD machines is an active research area,
methods and applications are diverse, a common reason
requiring a set of communications subroutines,
for using parallel machines is processing speed. Many of
BLACS (Basic Linear Algebra Communications
the applications have been developed in parallel versions
Subroutines).
of Fortran. The complexity of the developed algorithms

ranges from the simple, based on summations, to the highly


 Parallel computers can be very dicult to program and

debug. Both SIMD and MIMD computers require


complex, such as Wollan's algorithm for all subsets regres-
extra programming skills. In most cases SIMD
sion. Many of the MIMD applications have been parallelized
machines require less detailed knowledge of parallel
by a similar approach, namely the algorithm is decomposed
computation and o€er more powerful system support
into relatively independent subsets and each subset assigned
tools. MIMD programming is complex when much
to a processor (the SPMD model; see Section 2.6). Perfor-
communication is required, although library routines
mance problems were typically induced by load balancing
for communication could be developed.
diculties and small data sets. Little attention seems to

have been paid to the accuracy and stability of parallel


 The wide variety of parallel architectures. Many di€erent

designs of parallel computer exist, each with unique


algorithms. Indeed this seems to be an area o€ering
computational properties. The availability of parallel
much scope for research.
computers to statisticians may be dictated by cost and
Many di€erent types of parallel computer have been
not by how appropriate the computer is to the statisti-
investigated for statistical applications. The use of parallel
cian. Note that in general SIMD machines require less
machines appears to be demand driven and little uni®ed
e€ort to obtain good performance, although such
e€ort has been made to utilize parallel computers more
machines are generally higher in cost than MIMD
e€ectively. Perhaps the vectorization of the SAS package
machines.
will start to change this situation.

We attribute the limited use of parallel computers to the


 The parallel architecture must be suited to the

application. For the best results a statistical problem


following:
must be expressible in a form that can be implemen-

 Novelty of parallel computers. Parallel computers are ted in an ecient manner on the particular parallel

not widely available, nor have they achieved wide- computer being used. The task of developing algo-

spread commercial success. Indeed, many hardware rithms into the appropriate form can be complicated

and software issues in parallel processing remain active and time consuming.

research topics.

 Modern sequential computers provide sucient power to For a statistical application to justify a parallel solution,

drive standard packages for most statistical problems. it must require facilities not available on conventional com-

Statistical computing can generally be accommodated puters. Typically the requirement is speed, although it could

adequately by conventional computers. Even if a job also be large amounts of memory or disk space. Currently

takes a long time, it is easy for the statistician to leave the use of parallel computers for statistical applications

a computer running overnight or over a weekend. has a potentially long software development time. Hence,

 The current absence of standard packages. Most sta- the statistician should think carefully about the return

tistical computing is based on statistical packages. from parallel computers before using them. In particular
Parallel processing for statistical computation 47

the cost of development on a parallel computer must be Cray-1 Computer Systems (1981) Fortran (CFT) reference manual.

balanced against the performance bene®ts likely to ensue. Publication No. SR-0009, Rev. H.

de Doncker, E. and Kapenga, J. (1989) Parallel multivariate


At present (many) parallel computers are inappropriate
numerical integration. In G. Rodrigue (ed.), Parallel Processing
for routine data analysis. This may start to change with
for Scienti®c Computing, pp. 109±13. SIAM, Philadelphia.
the vectorization of the SAS system, although use of such
de Doncker, E. and Vakalis, I. (1993) Convergence results and
software would still be constrained by the availability of the
speedup of parallel numerical integration algorithms. In R.
hardware. For the patient researcher who has a computer-
F. Sincovec, D. E. Keys, M. R. Leuze, L. R. Petzold and
intensive problem, however, the increased power of parallel
D. A. Reed (eds), Parallel Processing for Scienti®c Computing,
machines may yield large returns. Vol. 2, pp. 539±45. SIAM, Philadelphia.
The wide availability of workstation clusters and other de Doncker, E., Kapenga, J. A., and McKean, J. W. (1989) Robust

networked systems o€ers the most immediate prospect projection pursuit. In K. Berk and L. Malone (eds) . Computer

of parallel processing hardware to statisticians. Specialist Science and Statistics. Proceedings of the 21st Symposium on

parallel computers will undoubtedly become more widely the Interface, pp. 308 ± 13. American Statistical Association.

Dongarra, J. J. and Sorenson, D. C. (1987) A portable environment


available in the future. Whether statisticians adopt them
for developing parallel Fortran programs. Parallel Computing,
for use depends, we believe, upon the facilities and tools
5, 139±54.
developed by hardware and in particular statistical software
Dongarra, J. J. and Tourancheau, B. (1992) Environments and
manufacturers.
Tools for Parallel Scienti®c Computing. North-Holland,

Amsterdam.

Dongarra, J. J., Du€, I. S., Sorenson, D. C. and van der Vorst, H. A.

(1991) Solving Linear Systems on Vector and Shared Memory


Acknowledgement Computers, SIAM, Philadelphia.

Du Croz, J. (1990) Supercomputing with the NAG Library.

We are grateful to the anonymous referees for their useful Supercomputer, 7(2), 72 ± 80.

comments which lead to a much improved paper. Earlier Durst, M. J. (1987) Library software in the supercomputing

environment. In R. M. Heiberger (ed.), Computer Science


drafts bene®ted from the constructive criticism of
and Statistics, Proceedings of the 19th Symposium on the
Professor Roger Payne of Rothamsted Experimental
Interface, pp. 7 ± 12. American Statistical Association.
Station.
Eddy, W. F. (1986) Parallel architecture: a tutorial for statisticians.

In T. M. Boardman and I. M. Stefanski (eds), Computer

Science and Statistics, Proceedings of the 18th Symposium on

References the Interface, pp. 23±9. American Statistical Association.

Eddy, W. F. and Schervish, M. J. (1986) Discrete-®nite inference

Akl, S. (1985) Parallel Sorting Algorithms. Academic Press, New on a network of Vaxes. In T. M. Boardman and I. M.

York. Stefanski, (eds), Computer Science and Statistics, Proceedings

Al-Jumeily, D. M., Clegg, D. B., Pountney, D. C. and Harris, P. of the 18th Symposium on the Interface, pp. 30 ± 6. American

(1994) Optimising Simple Statistical Calculations Using Mem- Statistical Association.

ory Computers. No. CMS 5, School of Computing and Math- Eddy, W. F., Meyer, M. M., Mockus, A., Schervish, M. J., Tan,

ematical Sciences, Liverpool John Moores University. K. and Viele, K. (1992) Smoothing census adjustment factors:

Anderson, S. L. (1990) Random number generators on vector an application of high performance computing. In H. J.

computers and other advanced architectures. SIAM Review, Newton (ed.), Computing Science and Statistics, Proceedings

32(2), 221 ± 51. of the 24th Symposium on the Interface, pp. 503 ± 10. American

BaÈck, T. and Ho€meister, F. (1994) Basic aspects of evolution Statistical Association.

strategies. Statistics and Computing, 4, 51 ± 63. Eddy, W. F. and Schervish, M. J. (1991) Parallel computing Ð a

Bailey, D. H. (1991) Twelve ways to fool the masses when giving tutorial for statisticians. In E. M. Keramidas (ed.), Computing

performance results on parallel computers. Supercomputer, Science and Statistics, Proceedings of the 23rd Symposium

8(5), 4 ± 7. on the Interface, pp. 479 ± 86. Interface Foundation North

Bertsekas, D. P. and Tsitsiklis, J. N. (1989) Parallel and Distributed America.

Computation, Prentice-Hall, Englewood Cli€s, NJ. Efron, B. and Tibshirani, R. J. (1993) An Introduction to the Boot-

Brophy, J. F., Gentle, J. E., Li, J. and Smith, P. W. (1989) Soft- strap. Chapman and Hall, London.

ware for advanced architecture computers. In K. Berk and Encore (1988) Encore Parallel Fortran, Ref. No. 724 ± 06785,

L. Malone (eds), Computer Science and Statistics, Proceedings Encore Computer Corporation, Fort Lauderdale, FL.

of the 21st Symposium on the Interface, pp. 116±20. American Fahrmeir, L. (1977) Parallel estimation algorithms for stochastic

Statistical Association. parameters of time series models. In L. Feilmeier (ed.) Parallel

Carriero, N. and Gelernter, D. (1989) LINDA in context. Commu- Computers Ð Parallel Mathematics, pp. 99 ± 102. North-

nications of the ACM, 32(4), 444 ± 58. Holland, Amsterdam.

Chambers, J. M. (1977) Computational Methods for Data Analysis. Flynn, M. J. (1972) Some computer organisations and their

Wiley, New York. e€ectiveness. IEEE Transactions on Computers, 21(9), 948±60.


48 Adams, Kirby, Harris and Clegg

Freeman, T. L. and Philips, C. (1992) Parallel Numerical Algorithms. Jaeckel, L. A. (1972) Estimating regression coecients by minimis-

Prentice-Hall, Englewood Cli€s, NJ. ing the dispersion of the residuals. Annals of Mathematical

Freisleben, B. (1993) Parallel learning algorithms for principal Statistics, 43, 1449±58.

component extraction. In Proceedings of the 3rd International Kapenga, J. A. and McKean, J. W. (1987) The vectorisation of

Conference on Arti®cial Neural Networks, 372, 267±71. algorithms for R-estimates in linear regression. In R. M.

Furnival, G. M. and Wilson, R. W., Jr. (1974) Regression by leaps Heiberger (ed.), Computer Science and Statistics, Proceed-

and bounds. Technometrics, 16, 299 ± 511. ings of the 19th Symposium on the Interface, pp. 502 ± 5.

Geist, A., Beguelin, A., Dongarra, J., Weichang, J., Manchek, R. American Statistical Association.
.
and Sunderam, V. (1993) PVM 3 0 User's Guide and Reference Kaufman, L. and Rousseeuw, P. J. (1986) Clustering large data sets.

Manual. Tech. Rept. ORNL/TM-12187, Oak Ridge National In E. Gelsema and L. Kanal (eds), Pattern Recognition in Prac-

Laboratory. tice II, pp. 425±37. Elsevier/North-Holland, Amsterdam.

Geist, A., Beguelin, A., Dongarra, J., Jiang, W., Manchek, R. and Kaufman, L., Hopke, P. K. and Rousseeuw, P. J. (1988) Using a

Sunderam, V. (1994) PVM: Parallel Virtual Machine Ð A parallel computer system for statistical resampling methods.

Users' Guide and Tutorial for Networked Parallel Computing. Computational Statistics Quarterly, 2, 129 ± 41.

MIT Press, Cambridge, MA. (also available online http:// Kaufmann, W. J. and Smarr, L. L. (1993) Supercomputing and the

www.netlib.org/pvm3/book/pvm-book.html). Transformation of Science. Scienti®c American Library.

Gladwell, I. (1987) Vectorisation of one dimensional quadrature Kleijnen, J. P. C. (1990) Supercomputers for Monte Carlo Simula-

codes. In G. Fairweather and P. M. Keast (eds), Numerical tion: Cross-validation versus Rao's test in multivariate analysis.

Integration. Recent Developments, Software and Applications, In K. H. Jockes, G. Rothe and W. Sendler (eds), Bootstrapping

NATO ASI Series C203, pp. 230 ± 8. and Related Techniques, pp. 233±45. Springer-Verlag, Berlin.

Golub, G. and Ortega, J. M. (1993) Scienti®c Computing an Kleijnen, J. P. and Annink, B. (1992) Vector computers, Monte

Introduction with Parallel Computing. Academic Press, Carlo simulation and regression analysis: an introduction.

New York. Management Science, 38(2), 170 ± 81.

Gonzalez, C., Chen, J. and Sarma, J. (1988) A tool to generate Lafaye de Micheaux, D. (1984) Parallelization of algorithms in

FORTRAN parallel code for the Intel IPSC/2 Hypercube. the practice of statistical data. In T. Havra nek, Z. Sidak

In E. J, Wegman, D. T. Gantz and J. J. Miller (eds). Computer and M. Novak (eds), COMPSTAT '84 Ð Proceedings in

Science and Statistics. Proceedings of the 20th Symposium on the Computational Statistics, pp. 293 ± 300. Vienna.

Interface, pp. 214±9. American Statistical Association. Lewis, T. G. and El-Rewini, H. (1992) Introduction to Parallel

Grenander, U. and Miller, M. I. (1994) Representation of know- Processing, Prentice-Hall, Englewood Cli€s, NJ.

ledge in Complex Systems. Journal of the Royal Statistical Lootsma, F. A. (1989) Parallel Non Linear Optimisation. No. 89 ±

Society, Series B, 54(4), 549 ± 603. 45 Faculty of Tech. Math. and Informatics, Delft University

Havra nek, T. and StratkosÏ, Z. (1989) On practical experience of Tech.

with parallel processing of linear models. Bulletin of the Lootsma, F. A. and Ragsdell, K. M. (1988) State-of-the-art in

International Statistical Institute, 53, 105 ± 17. parallel nonlinear optimisation. Parallel Computing, 6,

Hawkins, D. M., Simono€, J. S. and Stromberg, A. J. (1994) 133 ± 55.

Distributing a computationally intensive estimator: the Malfait, M., Roose, D. and Vandermeulen, D. (1993) A convergence

case of exact LMS regression. Computational Statistics, 9, measure and some parallel aspects of Markov chain Monte

83 ± 95. Carlo algorithms. In Su-Shing Chen (ed.), Neural and

Healey, A. R. and Davies, S. T. (1983) Statistical model ®tting Stochastic Methods in Image and Signal Processing, Proc.

on the ICL distributed array processors. In M. Feilmeier, SPIE 2032, 23 ± 34.

J. Joubert and U. Schendel (eds), Parallel Computing '83 McCullagh, P. and Nelder, J. A. (1983) Generalised Linear Models.

pp. 311 ± 17, Elsevier, Amsterdam. Chapman and Hall, London.

HeÂna€, P. J. and Norman, A. L. (1987) Solving nonlinear econo- McKean, J. W. and Hettmansperger, T. P. (1978) A robust analysis

metric models using vector processors. In T. M. Boardman of the general linear model based on one step R-estimates.

and I. M. Stefanski (eds), Computer Science and Statistics, Biometrika, 65, 571±9.

Proceedings of the 18th Symposium on the Interface, pp. Mitchell, T. J. and Beauchamp, J. J. (1986) Algorithms for Bayesian

348 ± 51. American Statistical Association. variable selection in regression. In T. M. Boardman (ed.), Com-

Hockney, R. W. and Jesshope, C. R. (1988) Parallel Computers 2. puter Science and Statistics, Proceedings of the 18th Symposium

Adam Hilger, Bristol. on the Interface, pp. 181±2. American Statistical Association.

Huber, P. J. (1985) Projection pursuit. Annals of Statistics, 13, Mitchell, T. J. and Morris, M. D. (1988) A Bayesian approach to

435 ± 525. the design and analysis of computational experiments. In

Hwang, K. (1993) Advanced Computer Architecture: Parallelism. E. J. Wegman, D. T. Gantz and J. J. Miller (eds), Computer

Scalability, Programmability. McGraw-Hill, New York. Science and Statistics. Proceedings of the 20th Symposium on

Ihnen, L. (1989) Vectorisation of the SAS(R) System. In K. Berk the Interface, pp. 49 ± 51. American Statistical Association.

and L. Malone (eds), Computer Science and Statistics. Pro- Modi, J. J. (1988). Parallel Algorithms for Matrix Computations.

ceedings of the 21st Symposium on the Interface, pp. 121 ± 7. Clarendon Press, Oxford.

American Statistical Association. O'Sullivan, F. and Pawitan, Y. (1993) Multidimensional density

Inmos (1990) Transputer Development System (2nd edn.). Prentice- estimation by tomography. Journal of the Royal Statistical

Hall, Englewood Cli€s, NJ. Society, Series B, 55(2), 509 ± 21.


Parallel processing for statistical computation 49

Ortega, J. M., Voigt, R. G. and Romine, C. H. (1990) A biblio- Computer Science and Statistics, Proceedings of the 18th

graphy on parallel and vector numerical algorithms. In K. Symposium on the Interface, pp. 11 ± 14. American Statistical

A. Gallivan, M. T. Heath, E. Ng, et al. Parallel Algorithms Association.

for Matrix Computations, pp. 125 ± 97. SIAM, Philadelphia. Stewart, G. W. (1988) Parallel linear algebra in statistical compu-

Ostrouchov, G. (1987) Parallel computing on a hypercube: an tations. In D. Edwards and N. E. Raun (eds), COMPSTAT

overview of the architecture and some applications. In R. M. '88, Proceedings in Computational Statistics, pp. 3±14. Phy-

Heiberger (ed.), Computer Science and Statistics, Proceedings sica-Verlag, Vienna.

of the 19th Symposium on the Interface, pp. 27±32. American Stine, R. A. and Woteki, T. H. (1989) A graphical programming

Statistical Association. environment for statistical simulations with parallel processing.

Perrott, R. H. (1987) Parallel Programming. Addison-Wesley, In ASA Proceedings of the Statistical Computing Section, pp.

Reading, MA. 104±9. American Statistical Association.

Quinn, M. J. (1987) Designing Ecient Algorithms for Parallel StratkosÏ, Z. (1987) E€ectivity and optimizing algorithms and pro-

Computers. McGraw-Hill, New York. grams on the host-computer/array processor systems. Paral-

Raphalen, M. (1982) Applying parallel processing to data analysis: lel Computing 4, 197 ± 207.

computing a distance's matrix on an SIMD machine. In Sylwestrowicz, J. D. (1982) Parallel processing in statistics. In

H. Caussinus, P. Ettinger and R. Tomassone (eds), H. Caussinus, P. Ettinger and R. Tomassone (eds),

COMPSTAT '82 Ð Proceedings in Computational Statistics, COMPSTAT '82 Ð Proceedings in Computational Statistics,

pp. 382 ± 6. Physica-Verlag, Vienna. pp. 131 ± 6. Physica-Verlag, Vienna.

Rousseeuw, P. J. (1984) Least median of squares regression. Journal Thisted, R. A. (1988) Elements of Statistical Computing. Chapman

of the American Statistical Association, 79, 871±80. and Hall, London.

Schervish, M. J. (1988) Applications of parallel computation to Wilson, G. V. (1993) A glossary of parallel computing

statistical inference. Journal of the American Statistical terminology. IEEE Parallel and Distributed Terminology,

Association, 83(404), 976 ± 83. February, pp. 52 ± 67.

Schervish, M. J. and Tsay, R. S. (1988) Bayesian modelling and Wollan, P. (1988) All-subsets regression on a hypercube computer.

forecasting in large scale time series. In J. C. Spall (ed.), Baye- In E. J. Wegman, D. T. Gantz and J. J. Miller (eds), Computer

sian Analysis of Time Series and Dynamic Models, pp. 23 ± 52. Science and Statistics. Proceedings of the 20th Symposium on the

Marcel Dekker, New York. Interface, pp. 224±7. American Statistical Association.

Schnabel, R. B. (1988) Sequential and parallel methods for uncon- Xu, C. W. and Shiue, W. K. (1991) Parallel bootstrap and inference

strained optimization. Tech. Rept. CU-CS-414-88, Dept. of for means. Computational Statistics Quarterly, 3, 233±9.

Comput. Sci., University of Colorado at Boulder, CO. Xu, C. W. and Shiue, W. K. (1993) Parallel algorithms for least

Schork, N. J. and Hardwick, J. (1990) Supercomputer-intensive median of squares regression. Computational Statistics and

multivariable randomization tests. In C. Page and R. LePage Data Analysis, 16, 349 ± 62.

(eds), Computing Science and Statistics, Proceedings of the 22nd Xu, M., Miller, J. J. and Wegman, E. J. (1989) Parallelizing

Symposium on the Interface, pp. 509±13. Springer-Verlag, New mutiple linear regression for speed and redundancy: an

York. empirical study. In K. Berk and L. Malone (eds), Computer

Skvoretz, J., Smith, S. A. and Baldwin, C. (1992) Parallel processing Science and Statistics. Proceedings of the 21st Symposium on

applications for data analysis in the social sciences. Concur- the Interface, pp. 138 ± 44. American Statistical Association.

rency: Practice and Experience, 4(3), 207±21. Zenios, S. A. (1989) Parallel numerical optimization: current

Stewart, G. W. (1986) Communication in parallel algorithms: an status and an annotated bibliography. Operational Research

example. In T. M. Boardman and I. M. Stefanski (eds), Society of America Journal of Computing, 1, 20 ± 43.

You might also like