Intel Adaptive Spike-Based Solver 1.0 User Guide

Intelr Adaptive
Spike-Based Solver 1.0

User Guide
I V1
W2 I V2
W3 I V3
W4 I
1
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNEC-
TION WITH INTEL r PRODUCTS. NO LICENSE, EXPRESS OR IM-
PLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL
PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT
AS PROVIDED IN INTELS TERMS AND CONDITIONS OF SALE FOR
SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER,
AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY,
RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUD-
ING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A
PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT
OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROP-
ERTY RIGHT.
UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL
PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLI-
CATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD
CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY
OCCUR.
Intel may make changes to specifications and product descriptions at any
time, without notice. Designers must not rely on the absence or characteris-
tics of any features or instructions marked reserved or undefined. Intel
reserves these for future definition and shall have no responsibility whatso-
ever for conflicts or incompatibilities arising from future changes to them.
The information here is subject to change without notice. Do not finalize a
design with this information.
The products described in this document may contain design defects or er-
rors known as errata which may cause the product to deviate from published
specifications. Current characterized errata are available on request.
Contact your local Intel sales oce or your distributor to obtain the latest
specifications and before placing your product order.
Copies of documents which have an order number and are referenced in this
document, or other Intel literature, may be obtained by calling 1-800-548-
4725, or by visiting Intels Web Site.
Intel processor numbers are not a measure of performance. Processor num-

bers dierentiate features within each processor family, not across dierent
processor families. See http://www.intel.com/products/processor number
for details.
BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino logo, Core In-

side, FlashFile, i960, InstantIP, Intel, Intel logo, Intel386, Intel486, In-
tel740, IntelDX2, IntelDX4, IntelSX2, Intel Core, Intel Inside, Intel Inside
logo, Intel. Leap ahead., Intel. Leap ahead. logo, Intel NetBurst, In-
tel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel
StrataFlash, Intel Viiv, Intel vPro, Intel XScale, IPLink, Itanium, Itanium
Inside, MCS, MMX, Oplus, OverDrive, PDCharm, Pentium, Pentium In-
side, skoool, Sound Mark, The Journey Inside, VTune, Xeon, and Xeon
Inside are trademarks of Intel Corporation in the U.S. and other countries.
2
* Other names and brands may be claimed as the property of others.
c 2008, Intel Corporation. All rights reserved.

Copyright
3
Contents
1 Overview 6
1.1 A Quick What, Why, and How . . . . . . . . . . . . . . . . . 6
1.2 A Hello World Example . . . . . . . . . . . . . . . . . . . . 9
1.3 Future Developments . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 User Guide Outline . . . . . . . . . . . . . . . . . . . . . . . . 11
2 The Subroutine SPIKE 12

2.1 Setting the environment . . . . . . . . . . . . . . . . . . . . . 12
2.2 Autoadapt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Disabling Spike Adapt . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Running the spike adapt.exe command . . . . . . . . . . . 18
3 Separate calls 20
4 Banded Preconditioner 22
5 Manual Data Partition 23

5.1 Dense Banded Format . . . . . . . . . . . . . . . . . . . . . . 23
5.2 Sparse CSR Format . . . . . . . . . . . . . . . . . . . . . . . 24
6 Intel
r
Adaptive Spike-Based Solver Examples 27
6.1 Example1: Automatic Partitioning . . . . . . . . . . . . . . . 27
6.2 Example2: Automatic Partitioning and Multiple RHS . . . . 29
6.3 Example3: Automatic Partitioning and Multiple RHS with
Separate Factorization and Solution . . . . . . . . . . . . . . 30
6.4 Example4: Manual Partitioning . . . . . . . . . . . . . . . . . 32
6.5 Example5: Automatic Partitioning Using the CSR Input For-
mat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.6 Example 6: Automatic Partitioning Using the CSR Input
Format with a Preconditioner . . . . . . . . . . . . . . . . . . 35
6.7 Toeplitz Matrix Example . . . . . . . . . . . . . . . . . . . . 37
6.8 Sparse Banded Matrix Example . . . . . . . . . . . . . . . . . 38
6.9 Calling Intel
r
Adaptive Spike-Based Solver from C Programs 41
7 Reference guide 43
7.1 Intel
r
Adaptive Spike-Based Solver 1.0 directory structure . 43
7.2 Intel
r
Adaptive Spike-Based Solver and ScaLAPACK . . . . 46
4
7.3 Spike Default . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7.4 Spike . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.5 Spike Begin . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.6 Spike Preprocess . . . . . . . . . . . . . . . . . . . . . . . . 50
7.7 Spike Process . . . . . . . . . . . . . . . . . . . . . . . . . . 51
7.8 Spike End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
7.9 spike param details . . . . . . . . . . . . . . . . . . . . . . . 52
7.10 matrix data details . . . . . . . . . . . . . . . . . . . . . . . 53
7.11 info details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Bibliography 54
A Mathematical Description of Key Strategies 59

A.1 Az = r via TU . . . . . . . . . . . . . . . . . . . . . . . . . . 60
A.2 Az = r via FL . . . . . . . . . . . . . . . . . . . . . . . . . 62
A.3 Az = r via RL/RP . . . . . . . . . . . . . . . . . . . . . . 62
A.4 Az = r via TA . . . . . . . . . . . . . . . . . . . . . . . . . 64
A.5 Az = r via EA . . . . . . . . . . . . . . . . . . . . . . . . . 65
B How Spike Adapt Works 67

B.1 Why is Spike Adapt Necessary? . . . . . . . . . . . . . . . . . 67
B.2 How Does Spike Adapt Work? . . . . . . . . . . . . . . . . . 67
B.3 Spike Adapt Return Codes . . . . . . . . . . . . . . . . . . . 68
C MPI Compatibility Library 69
5
Chapter 1
Overview
1.1 A Quick What, Why, and How

Intelr
Adaptive Spike-Based Solver is a software package for solving large,
banded linear systems on parallel computers. Solving banded linear systems
is a crucial step in many high-performance computing (HPC) applications.
For example, they frequently arise after a general sparse matrix is reordered
in some fashion. In other instances, banded systems are used as eective
preconditioners to general sparse systems where they are solved via iterative
methods. Existing parallel software using direct methods for banded matri-
ces are mostly based on LU factorizations. In contrast, our software pack-
age is based on a dierent decomposition method that increases arithmetic
costs but naturally leads to lower communication overhead, which is ad-
vantageous on modern parallel architectures where arithmetic performance
has outpaced memory and network performance. Thus, Intel r
Adaptive
Spike-Based Solver oers HPC users a new and valuable tool.
The central idea behind the package is a dierent decomposition of a
matrix [10, 4, 6, 2, 5, 8, 9] compared to the common LU decomposition that
represents a matrix A as a product of lower and upper triangular matrices
A = LU. Consequently, solving AX = F can be achieved by solutions of
two triangular systems LG = F and U X = G. In contrast, our software
is based on a decomposition motivated by the important case where A is a
banded matrix. Figure 1.1 shows a banded matrix and its partitioning for
parallel processing. The decomposition takes the form A = DS. Here, D is
A1
B1
A2
C2
Partitioned
A =
B2
A3
C3
Figure 1.1: A banded matrix with a conceptual partition
6
block diagonal matrix consisting of all the Aj blocks (see Figure 1.1) and
S is D1 A, assuming for the moment that the Aj blocks are non-singular.
Matrix S has the structure of an identity matrix with some extra spikes,
hence the name of the package (Figure 1.2). In practice, D and S may
A = D S
A1 A1
I V1
B1
A2 A2
C2
W2 I V2
B2
A3 A3
C3 W3 I
Figure 1.2: Decomposition where A = DS, S = D 1 A
not be obtained exactly, either intentionally or due to limitations such as

singularity. Instead, the numerical algorithm yields D and S that resemble
the structures of D and S in Figure 1.2 and satisfy an equation of the form
A = D S + R
for some residual R. Even when R is non-zero, it is by design small in some

sense. The basic method employed in the software package is as follows:
solve (D S + R)X = F via a preconditioned iterative method
(with preconditioner M = D S);
solve systems of the form M Z = Y for varying Y s;
end
The key step of this iterative method is the solution of systems with the
D S matrix. Solving AX = F can now be seen as involving three steps
conceptually:
1. Solving the block-diagonal system DG = F . Because D consists of
decoupled systems of each diagonal block Ai , they can be solved in
parallel without synchronization between the individual systems. A
number of strategies based on the LU decomposition of each Ai can be
applied here. These include variants such as LU without pivoting, LU
with pivoting, as well as a combination of LU and U L decompositions
with or without pivoting.
2. Solving the system SY = G. This system has the wonderful character-

istic that it is also largely decoupled. Except for a reduced system near
the junction between the identity blocks, the rest are independent. The
natural way to tackle this system is to first solve the reduced system
using parallel algorithms that require interprocessor communication,
followed by retrieval of the rest of the solution without requiring fur-
ther interprocess communication. Here again, a number of dierent
strategies exist for solving the reduced system.
7
3. Depending on how D and S were obtained earlier, which is related to
the exact strategy used in the two previous steps, R can be zero or
non-zero. If R is zero, then of course the Y obtained is the desired
solution to AX = F . Otherwise, some corrections must be computed.
This can be accomplished by a number of standard iterative methods
such as iterative refinement, GMRES, or BiCGStab, just to name a
few.
All in all, a large variety of strategies can be applied based on the basic
decomposition A = DS and the realization of the approximations D and
S; i.e., A = D S + R in which R is a correction, where M = D S is
an eective preconditioner for a variety of iterative schemes. The package
oers a number of choices to solve AX = F based on the framework of this
decomposition.
One can use the software to compute the solution of AX = F by a single
call where the specific strategy can be selected automatically or manually.
A user can also solve a system by issuing several step-by-step calls similar to
separating the LU factorization and the forward/backward substitutions in
LAPACK [1]. In this case, the user can handle more interesting situations
including the solution of dierent right-hand sides (RHS) at dierent times,
AXi = Fi while amortizing those one-time computation costs related to the
same matrix A.
To summarize, Intel r
Adaptive Spike-Based Solver 1.0 aims to solve
AX = F in parallel where A is a banded matrix. It currently supports users
using MPI to express parallelism. The algorithmic framework is based on a
decomposition of the form A = D S + R. This framework allows many
dierent strategies that can exploit special properties of the underlying pro-
cessor architectures, network properties, as well as the numerical nature of
the input matrix A. Intel r
Adaptive Spike-Based Solver 1.0 consists of two
main layers: a computational layer called Spike Core and a strategy selection
layer called Spike Adapt. Spike Core consists of the necessary linear algebra
software to support dierent solution strategies whereas Spike Adapt is an
independent layer that selects an ecient strategy based on the characteris-
tics of the input matrix A and the underlying computer system. By default,
Spike Adapt automatically picks a strategy on the users behalf. Neverthe-
less, expert users have the option to pick a strategy manually. A strategy
is defined by algorithmic choices for each of the three steps (involving D, S,
and as needed for non-zero R) outlined previously.
A user can ask for the solution to the problem AX = F via a single
software library call. This is covered in Chapter 2. Alternatively, this single
function call can be replaced by separate calls similar to separating the
calls to triangular factorization and the subsequent triangular solves. This
added complexity is especially worthwhile when solutions with dierent RHS
for the same matrix A are needed at dierent times, allowing the common
preprocessing cost pertaining to A to be amortized. Invoking the package
with multiple function calls is covered in Chapter 3. Finally, concerning
data distribution, the user can provide the complete matrix A and the RHS
in the MPI master process and rely on the functionality provided by the
8
software package to distribute the data to the remaining MPI processes.
Alternatively, the user can manually distribute the data. Chapter 5 covers
the data distribution options in greater detail.
1.2 A Hello World Example

This example solves a 32-by-32 tridiagonal Toeplitz system with 6 on the
diagonal, -1 on the two o-diagonals, and the constant vector 1 as the RHS.
That is, solve for X where

6 1 1
1 6 1 1
..

.. .. ..

. . . X = .

. .. .. .
.. ..

. .

1 6 1 1
1 6 1
A single call to the SPIKE subroutine takes care of data distribution and
strategy selection. The user only needs to set a few global parameters such
as number of processors, the local MPI rank, and the structure and the band-
width of the matrix. The matrix and RHS data are stored initially on the
MPI master process (i.e., process-0). The source code of hello world.f90
is listed in Figure 1.3. To create the executable, compile the source program
and link it with the Intel r
Adaptive Spike-Based Solver 1.0 libraries which
also provide BLAS and LAPACK libraries. Assuming that package has been
installed in a directory called <SPIKE directory> and the user is compiling
the source program called hello world.f90:
mpiifort hello world.f90 -o hello world.exe \
-I<SPIKE directory>/include \
-L<SPIKE directory>/lib/<arch> \
-lspike -lspike mpi comm \
-lspike adapt -lspike adapt de -lspike adapt grid f \
-lmkl solver -lmkl lapack -lmkl -lguide -lpthread
where mpiifort is the Fortran compiler driver for the Intel MPI Library
and <arch> is either 64, for IA-64 architecture or em64t, for Intel r 64
architecture.
A run of the resulting executable hello world.exe may look like
mpirun np 4 hello world.exe

and the following is the output of the run:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> SPIKE SUMMARY >>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Spike Strategy EA3
TIME FOR PARTITIONING 2 . 0 7 4 5 9 9 2 6 6 0 5 2 2 4 6 e 02
TIME FOR SPIKE BANDED FACT 2 . 8 1 3 0 0 5 4 4 7 3 8 7 6 9 5 e 02
TIME FOR SPIKE BANDED SOLV 9 . 5 0 5 0 3 3 4 9 3 0 4 1 9 9 2 e 03
TIME FOR SPIKE (FACT+SOLV ) 3 . 7 6 3 5 0 8 7 9 6 6 9 1 8 9 5 e 02

RESIDUAL 3 . 8 8 5 7 8 0 5 8 6 1 8 8 0 4 8 e 16
# Outside i t e r a t i o n s : 0
SPIKE h a s s u c c e e d e d ( t o r e a c h t h e a c c u r a c y p s p i k e%e p s o u t )
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
9
INCLUDE s p i k e . f i
program h e l l o w o r l d c o d e
use s p i k e m o d u l e
use mpi
! b e f o r e t h e MPIINIT c a l l i n g s e q u e n c e s
i n t e g e r : : i , rank , n b p r o c s , c o d e
integer : : i n f o
type ( s p i k e p a r a m ) : : p s p i k e ! Spike parameter data s t r u c t u r e
type ( m a t r i x d a t a ) : : mat ! Spike matrix data s t r u c t u r e
double p r e c i s i o n , dimension ( : , : ) , a l l o c a t a b l e : : f ! r h s
c a l l MPI INIT ( c o d e )
c a l l MPI COMM SIZE(MPI COMM WORLD, n b p r o c s , c o d e )
c a l l MPI COMM RANK(MPI COMM WORLD, rank , c o d e )
! s e t up S p i k e p a r a m e t e r d a t a s t r u c t u r e on a l l p r o c e s s o r s
p s p i k e%n b p r o c s=n b p r o c s ; p s p i k e%r a n k=r a n k
c a l l SPIKE DEFAULT( p s p i k e ) ! default values for pspike
! s e t up S p i k e m a t r i x d a t a p a r a m e t e r s on a l l p r o c e s s o r s
mat%f o r m a t = D ; mat%a s t r u = G ; mat%d i a g d o = Y
mat%n = 3 2 ; mat%k l = 1 ; mat%ku = 1
! c r e a t e i n p u t m a t r i x and r h s on P r o c e s s o r 0
i f ( r a n k = = 0 ) then
a l l o c a t e ( mat%A ( 1 : mat%k l+mat%ku + 1 , mat%n ) )
a l l o c a t e ( f ( 1 : mat%n , 1 : 1 ) )
mat%A( 1 , : ) = 1 . 0 d0 ; mat%A ( 2 , : ) = 6 . 0 d0 ; mat%A( 3 , : ) = 1 . 0 d0
f = 1 . 0 d0
end i f
! one c a l l t o S p i k e f o r s o l v i n g Ax=f
c a l l SPIKE ( p s p i k e , mat , f , i n f o )
! solution is in f which resides in Processor 0
if ( i n f o >=0) then
i f ( r a n k = = 0 ) then
do i =1 ,mat%n
print , i , f ( i , 1 )
end do
end i f
end i f
c a l l MPI FINALIZE ( c o d e )
end program h e l l o w o r l d c o d e
Figure 1.3: A very simple example
1 0.207106781186548
2 0.242640687119285
3 0.248737341529163
4 0.249783362055695
5 0.249962830805007
6 0.249993622774344
7 0.249998905841058
8 0.249999812272004
9 0.249999967790968
10 0.249999994473804
11 0.249999999051855
12 0.249999999837324
13 0.249999999972089
14 0.249999999995211
15 0.249999999999174
16 0.249999999999835
17 0.249999999999835
18 0.249999999999174
19 0.249999999995211
20 0.249999999972089
21 0.249999999837324
22 0.249999999051855
23 0.249999994473804
24 0.249999967790968
25 0.249999812272004
26 0.249998905841058
27 0.249993622774344
28 0.249962830805007
29 0.249783362055695
30 0.248737341529163
31 0.242640687119285
32 0.207106781186548
10
1.3 Future Developments
Enhancements to Intel r
Adaptive Spike-Based Solver 1.0 will be made in
several orthogonal areas: the kinds of sparse matrices handled via added
utility functions, the set of solution strategies it encompasses, and the variety
of parallel environments it supports.
When A is a general sparse matrix, often times reordering can transform
it either into a banded matrix, or a low-rank perturbation of a banded
matrix. We intend to oer utilities for matrix reordering and capabilities to
handle more general sparse matrices.
In addition to the current LU -based strategies for handling the diagonal
blocks of the D matrix, we intend to add other strategies (e.g., based on
least-squares) to handle very ill-condition systems. Other data distribution
strategies that exhibit better load-balancing properties will also be added.
MPI is the only parallel environment supported currently but alternative
parallel environments may be considered in future releases.
1.4 User Guide Outline

The remainder of this guide describes the usage of the Intel
r
Adaptive
Spike-Based Solver 1.0 in greater detail.
Chapter 2 focuses on invoking the package with a single function call to
obtain the solution X to the equation AX = F where both A and F are
stored in the MPI master process.
Chapter 3 describes how to solve AX = F using multiple functions call
to the package. The motivating example is the solution of multiple RHS for
AXk = Fk where Fk are available at dierent times. This way the step that
performs setup related to A can be done just once. We assume A and Fk
are initially stored in the MPI master process.
Chapter 5 describes how the user can distribute A and F across multiple
MPI processes. This avoids the overhead of data distribution and allows the
solver to use the aggregate memory of a distributed-memory parallel com-
puter. The software package supports several distribution schemes including
ScaLAPACKs format. Thus, ScaLAPACK programs can be modified to use
Intel
r
Adaptive Spike-Based Solver with very little eort.
Chapter 6 presents a number of examples illustrating uses of the soft-
ware. Chapter 7 provides detailed reference material on the software package
directory structure and each of the provided functions.
11
Chapter 2
The Subroutine SPIKE
Intelr
Adaptive Spike-Based Solver 1.0 contains two main components:
Spike Core is the component that implements the underlying numerical
methods including for example the solution of the S system in A = DS+R,
factorization of the D system, and outer iterations to deal with a non-zero
R. The second component Spike Adapt implements a strategy selection
method based on information about the underlying architecture, computer
platform, and the linear system in question. The single driver Spike con-
veniently integrates and makes available the functionalities oered by these
two components to the user via a single call. In brief, this driver exercises
the strategy selection mechanism and then proceeds to solve AX = F for
X given A and F using the selected strategy. The user can find out what
strategy was chosen by examining several parameters in the program, or by
running the standalone binary executable spike adapt.exe (at command
line) that comes with the package. The user also has the option of selecting a
strategy manually through setting several parameters, but this requires more
detailed knowledge of how the strategies work. To this end, this chapter also
gives a brief guideline on choosing strategies, but defers to the Appendix for
a more mathematical description.
The single driver call is
call Spike(pspike, mat, f, info)
Related details are given in the rest of this chapter.
2.1 Setting the environment

Intelr
Adaptive Spike-Based Solver provides scripts to automatically ini-
tialize the user environment. They are located in the
<SPIKE directory>/tools/environment
directory where <SPIKE directory> is the packages main directory after in-
stallation. For example, it could be
/opt/intel/spike/1.0.
12
These scripts set environment variables that are needed to build and run
applications using the software package. Select the appropriate script for the
Linux shell and architecture. For example, to initialize the package for the
BASH shell on an Intel r EM64T system, execute the following command:
> source spikevarsem64t.sh
To initialize the package for CSH on an Itaniumprocessor

r system, use the
following command:
> source spikevars64.csh
It is recommended that the initialization command be placed in the appro-

priate shell startup file in $HOME; .cshrc or .bashrc for the CSH and BASH
shells, respectively.
2.2 Autoadapt
As illustrated in the hello world program in Figure 1.3, parameters con-
tained in two components of the derived type spike param variable pspike
need to be set. While the type spike param has many components, only
two need to be set manually by the user; the rest can be assigned default
values by making a call to the routine Spike Default. The two components
that need to be set are
Component Type Description
nbprocs integer number of processors - MPI related
rank integer rank of the local processor - MPI related
The rest of the components can be set to their default by calling the
routine Spike Default. For example
call Spike Default(pspike)
will set those components in the derived type spike param variable pspike to
their default values. These default values are given in Table 2.1. Note that
some of these components are inout in nature which means subroutine may
actually overwrite the input values as a result of executing the software. The
spike param derive type consists of a host of other output components.
Refer to Section 7.9 for comprehensive information.
2.3 Data
In this section we explain how we can set up the parameters within the
type matrix data variable mat that we use in our calling sequence example.
The type matrix data main purpose is to hold the matrix represented in a
number of popular representation. Both the LAPACK banded-type stor-
age format (without additional storage for pivoting) or CSR (Compressed
Sparse Row) format are supported. Depending on the the specific value of
13
Component Type Default Description
RSS char R Reduced System Strategy:

R, T, or F
DFS char P Diagonal Factorization Strategy:
P, L, U, or A
OIS integer 3 Outer Iteration Strategy:
3 (more options in the future)
The three components above together specify a strategy for solving a banded
system using the Spike framework. When the autoadapt component below
is set to .true.(which is the default value), the input values of these three
components are ignored and overwritten to record the automatically chosen
strategy. Section 2.4 has more details on manual strategy selection.
autoadapt logical .true. strategy automatically selected if .true.

autoadapt inputs logical .false. this component is only active when autoadapt is .true.
user then needs to specify extra input
components for the derived type matrix data
BPS integer 0 Banded Preconditioner Strategy

0 User does not specify a banded preconditioner
-1 A banded preconditioner is specified by user
threads integer 1 value of the OpenMP environment variable
OMP NUM THREADS if mat%format=S
(i.e. # of threads for the PARDISO solver
on each partition)
nbit out integer 50 max # of outer iteration

eps out double 107 accuracy residual outer iteration
nbit in integer 100 max # of inner iteration

eps in double 107 accuracy residual inner iteration
nzero double 109 new zero value for diagonal boosting O

if |pivot| < O 1 then
pivot pivot O 1
tp integer 0 data distribution:

0 data in Proc 0
1 data on each Procs (cf. Chapter 5)
memfree logical .false. deallocate memory for matrix
(the case when tp=0)
residual logical .true. compute the L relative residual norm

timing logical .false. provide timing information
comd logical .false. provide detailed running information
file output integer 6 print information to screen if 6,
file ID for spike.output otherwise
Table 2.1: List of input components for the derived type spike param. Note:
RSS, DFS, OIS are inout whereas the rest are input only.
14
pspike%tp in the variable pspike being passed to the routine, the mat variable
on MPI process-0 may be used to hold the full original, or the mat variable
on each of the MPI processes may be used to hold part of the original ma-
trix. In the former case, Spike Core will partition the data held on process-0
and distribute them to the other processes under the hood. In the latter
case, the user needs to manually put the appropriate part of the matrix
on each of the dierent MPI processes. Chapter 5 will give the necessary
details for one to perform this task. For now, Table 2.2 gives details of the
matrix data structure relevant for pspike%tp=0, that is, the user put the
complete matrix into the mat variable on process-0.
Finally, if the matrix data have been defined on process-0, the data for
the right-hand-side (rhs) should also be defined on process-0 as described in
Table 2.3.
2.4 Disabling Spike Adapt

While we recommend that the user set the autoadapt component to .true.,
it is possible to disable automatic strategy selection by setting the autoadapt
component to .false.. In this case, the strategy is defined by the values
set in the three components (RSS, DFS, OIS), which are set to (R,P,3) by
Spike Default. We explain in this section what these parameters mean and
oer a general guideline on how to set strategy manually.
Recall that the computational framework of the software package is based
on the decomposition
A = D S + R
where the structure of D and S are depicted in Figure 1.2. The generic way
used to solve the system AX = F can be described as:
Solve AX = F by a preconditioned iterative method
Use M as preconditioner where M = D S
The preconditioning step solving M Z = Y
The three components of a strategy are:

Reduced System Strategy RSS: The crux of a preconditioned iterative
scheme is the solution involving the preconditioner M . The key par-
allel algorithm is the handling of the S matrix. The portion Sred of
S near the partition boundaries constitute a reduced system; and the
key in solving system with S lies in the solution of this reduced system
Sred . There are several strategies in solving the reduced system:
R: stands for recursive. A recursive algorithm can be applied

to the reduced system.
F: stands for on the fly. The reduced system can be solved
using an iterative method. In this situation, there is no need to
have to compute the Sred matrix explicitly as one only need to
compute the action of Sred on vectors. These are computed on
the fly based mostly on the A matrix itself.
15
mat% Type Distribution Description
format char (in) global matrix format:

D: Dense; S: Sparse CSR
astru char (in) global matrix structure:
G: General non-symmetric
diagdo char (inout) global diagonal dominance
Y: Yes; N: No; I: Investigate
vdiagdo double (out) global computed diagonal dominance value

when mat%diagdo=I or
pspike%autoadapt=.true.
sparsity double (out) global computed degree of sparsity
when pspike%autoadapt=.true.
If pspike%autoadapt and pspike%autoadapt inputs are both .true., the fields

vdiagdo and sparsity become input and the user must specify these values. Spar-
sity is simply the ratio of non-zero elements to the total number of elements in the
band, i.e. (# non-zero elements)/(# elements in the band). Diagonal dominance
is computed as follows:
' )
|aii |
vdiagdo = min (N
j=1,j=i |aij |
i
n integer (in) global matrix dimension
The input field below is for the case mat%format=D
kl integer (inout) global # of subdiagonals in matrix

ku integer (inout) global # of superdiagonals in matrix
A double(bwd,mat%n) rank 0 LAPACK banded matrix format,
no extra pivoting space
bwd = mat%kl+mat%ku+1
The input fields below are associated with the sparse CSR format (mat%format=S)
nbsa integer rank 0 # of non-zero matrix elements

sa double(mat%nbsa) rank 0 CSR format, matrix elements
jsa integer(mat%nbsa) rank 0 CSR format, column indices
isa integer(mat%n+1) rank 0 CSR format, start-of-row indicies
Table 2.2: List of parameter fields of the type matrix data variable mat.
Here all the matrix data are stored in process-0. If space for mat%A in
process-0 is allocated dynamically, the user may want to have it deallocated
automatically by setting pspike%memfree = .true.All the other parameter
fields must be declared as global (i.e. common for each MPI process).
16
Parameter Type Distribution Description
f double(mat%n,nrhs) (inout) rank 0 Right-hand side f (in)
Solution x of Ax=f (out)
nrhs stands for # of RHS
Table 2.3: Definition of the RHS (in) and solution (out) stored in rank 0.
E: stands for explicit. Here the Vj and Wj blocks of the S ma-

trix are explicitly computed. The reduced system is solved in an
iterative manner.
T: stands for truncated. This is based on an exploitation of the
special structure of S. Should the top and bottom portions of
suitable sizes of the Vj and Wj blocks be zero, solution of the
reduced system Sred becomes extremely easy. This strategy sets
those blocks to zero deliberately (hence truncating the Vj and
Wj submatrices) and trade the ease of of solution of this slightly
wrong Sred system at the expense of corrective eort elsewhere.
Diagonal Factorization Strategy DFS: Solving DSZ = Y naturally in-

volves in one form or another solutions of system with the D matrix,
which is block diagonal in structure. For the current version, Version
1.0, we rely on various direct factorization algorithms to tackle this
problem. The strategies here correspond to factorizations of those
diagonal block matrices. Note however that while these strategies
normally correspond to familiar methods designed for dense matrices,
they can be overloaded to represent direct sparse matrix factoriza-
tions motivated by the corresponding dense versions. For example, in
the case of sparse bands, L refers to the factorization provided by the
popular package PARDISO [11].
P: stands for pivoting. This is LU factorization with partial piv-

oting.
L: stands for LU. This is the LU factorization without pivoting.
U: stands for UL. This is obtaining both the LU and UL factor-
izations, neither with pivoting.
A: stands for alternate. This alternate from block to block be-
tween LU and UL factorizations, without pivoting.
Outer Iteration Strategy OIS: represents the iterative method use in

the outermost layer. An integer value is used to direct a specific choice.
For the current release Version 1.0, we only support BiCGStab itera-
tive scheme which corresponds to the value 3.
While RSS and DFS are mostly orthogonal, they are not completely so. In-
deed, some factorization strategies are motivated and consequently appli-
cable only to some particular reduced system strategies. Therefore, not all
17
pspike%tp pspike%nbprocs
1 2 2n (n > 1) Even (= 2n ) Odd
TU FL
0 RL RP All All TU FL
EA TA
TU FL
1 None All TU FL TU FL
RL RP
Table 2.4: This table illustrates how the type of matrix partitioning and the
number of MPI processes aect the choice of (RSS,DFS) for the Spike Core
strategy. In future developments of Intelr
Adaptive Spike-Based Solver, the
choice of (RSS,DFS) will be independent of the setting of the tp component.
combinations of choices in RSS with DFS are supported or in fact meaning-

ful. In the current release, the following six combinations of (RSS,DFS) are
supported:
(T,U), (F,L), (R,L), (R,P), (T,A), (E,A).
Moreover, if mat%format=D the setting of the tp component of the

spike param variable as well as the number of processors also aect the
applicability of these six choices. In this case, Table 2.4 tabulates the appli-
cable strategies under dierent tp and nbprocs setting. In the case where
mat%format=S only the combination (F,L) is allowed while Spike Adapt
is turned o.
2.5 Running the spike adapt.exe command

User applications do not call Spike Adapt directly. Rather, Spike Core calls
Spike Adapt if the autoadapt component element of the spike param struc-
ture is set to true. Note that in this case the user-specified (RSS,DFS,OIS)
values are ignored and in fact will be overwritten. Nevertheless, a standalone
executable
spike adapt.exe
is provided in the location
<SPIKE directory>/bin/<arch>
where arch is either 64, for IA-64 architecture, or em64t, for Intel r 64
architecture. Given a set of input characteristics (matrix size, bandwidth,
number of MPI processes, sparsity, diagonal dominance, the number of right-
hand sides, type of matrix partitioning), this executable will suggest an
optimal Spike Core strategy. Edit the Fortran NAMELIST file, ivars.nml,
to specify the matrix parameters, e.g.:
&IVAR matrix_size = 400000

bandwidth = 161
n_proc = 4
18
sparsity = 0.9d0
diagonal_dominance = 1.2d0
n_rhs = 1
tp = 0 /
Simply run spike adapt.exe in the same directory as ivars.nml to get

a recommended Spike Core strategy, e.g.:
[cluster0]$ spike_adapt.exe
./spike_adapt.exe
Bandwidth = 161
Diagonal dominance = 1.20000000000000
Matrix size = 400000
Sparsity = 0.900000000000000
# RHS = 1
# Procs = 4
Type of partition: 0
The Spike_Adapt performance models selected fl3
19
Chapter 3
Separate calls
A single call to Spike
CALL Spike(pspike,mat,f,info)
can be split into a calling sequence of four separate operations:

CALL Spike Begin(pspike,mat,pre,info)
CALL Spike Preprocess(pspike,pre,info)
CALL Spike Process(pspike,mat,pre,f,info)
CALL Spike End(pspike,mat,pre,info)
where
Spike Begin: beginning of the calling sequence;
Spike Preprocess: preprocessing of the preconditioner data struc-

ture;
Spike Process: processing of the matrix, preconditioner and the right-

hand side;
Spike End: ending of the calling sequence.
We can see in additional to pspike, mat, f and info, there is a new parameter
pre needed for the split calls. This parameter pre is of type matrix data
and pertains to a preconditioner. However, the user needs not set any of the
component values. Consider it a work array of some sort that the software
uses internally.
Splitting a single call to SPIKE is useful for applications having iterations
with changing right-hand-sides but using the same original matrix. The fol-
lowing program invokes Spike Process multiple times rather than invoking
Spike multiple times. Figure 3.1 presents a program solving two dierent
right hand sides: (1, 0, 0, 0, 0, 0, 0, 0)T and then (0, 1, 0, 0, 0, 0, 0, 0)T . Note
that the program uses the global partitioning scheme, so the right hand sides
are set up in node 0.
In the program, Spike Begin, Spike Preprocess and Spike End are
called once while Spike Process is called twice (once for each right hand
side). This program is expected to run faster than an equivalent one with
20
! Declare v a r i a b l e s u s e d b y SpikePACK
integer : : i n f o
type ( s p i k e p a r a m ) : : pspike
type ( m a t r i x d a t a ) : : mat , p r e
double p r e c i s i o n , dimension ( 8 , 1 ) : : f
...
! S e t up p s p i k e and mat a s usual
...
! The f o l l o w i n g t w o c a l l s a r e c a l l e d o n c e
c a l l S p i k e B e g i n ( p s p i k e , mat , pre , i n f o )
c a l l S p i k e P r e p r o c e s s ( p s p i k e , pre , i n f o )
! S o l v e f o r t h e f i r s t r i g h t hand s i d e
i f ( r a n k = = 0 ) then
f =0.0 d0
f ( 1 , 1 ) = 1 . 0 d0
end i f
! S p i k e P r o c e s s ( ) i s i n v o k e d f o r t h e f i r s t r i g h t hand s i d e
c a l l S p i k e P r o c e s s ( p s p i k e , mat , pre , f , i n f o )
! The s o l u t i o n o f t h e f i r s t RHS i s s t o r e d i n f a f t e r S p i k e P r o c e s s ( ) .
...
! S o l v e f o r t h e s e c o n d r i g h t hand s i d e
i f ( r a n k = = 0 ) then
f=f 0 . 1 d0
end i f
! S p i k e P r o c e s s ( ) i s i n v o k e d f o r t h e s e c o n d r i g h t hand s i d e
c a l l S p i k e P r o c e s s ( p s p i k e , mat , pre , f , i n f o )
! The s o l u t i o n o f t h e s e c o n d RHS i s s t o r e d i n f a f t e r S p i k e P r o c e s s ( ) .
...
! The f o l l o w i n g c a l l i s c a l l e d o n c e
c a l l S p i k e E n d ( p s p i k e , mat , pre , info )
...
Figure 3.1: A program solving two right hand side using separate Spike calls.
two Spike calls because this program only initializes and frees Spike data
structures once while a program calling Spike twice would have duplicated
these works.
21
Chapter 4
Banded Preconditioner
Intelr
Adaptive Spike-Based Solver can be used as a framework for solving
banded systems to be used as eective preconditioners for general sparse sys-
tems, which are solved via iterative methods. In future releases, we will oer
dierent options for enabling an automatic derivation of a robust banded
preconditioner from an arbitrary general sparse systems. In particular the
component %BPS for the derived type spike param in Table 2.1, has been
introduced to such eect. For the current version, Version 1.0, the compo-
nent %BPS can only take two values: 0 (no preconditioner default value) or
1 where the banded preconditioner has to be set by the user. Some users
may take advantage of this option in the case where banded preconditioners
can be constructed directly from an application at hand, such as in nano-
electronics nanowire simulations [7]. Using the separate calling sequences
presented in Chapter 3, one can decide on a preconditioner pre that will be
called by the preprocessing sequence, while the processing sequence takes
advantage of the obtained factorization of the preconditioner to accelerate
the outer-iterative schemes. Therefore, with the option %BPS=1 the user
has the possibility of defining his own banded preconditioner (either dense
or sparse within the band) for solving iteratively an original system ma-
trix that can be general sparse. Depending on the data distribution format
(component %tp), the user must define the preconditioner pre using the de-
rived type spike param in a similar way he defines the original matrix mat
either using Table 2.2 (%tp=0) or Table 5.1 (%tp=1). In Chapter 6, Example
6 illustrates the use of the option %BPS= 1.
22
Chapter 5
Manual Data Partition
It has been assumed until now that the matrix and RHS data initially reside
in the MPI master process (i.e., process-0). This is specified by setting the
spike param tp parameter to zero. When the matrix and RHS are entirely
in process-0, our software package automatically distributes a portion of the
data to each MPI process before invoking the solver. The price paid for
this convenience is the overhead associated with the data distribution and
potential limits on the overal problem size. Specifically, the problem size is
limited by the memory available to process-0.
Alternatively, we allow the user to partition dense matrices and RHSs
among the MPI processes before calling Spike Core. This chapter describes
the local partitioning schemes supported by Intel r
Adaptive Spike-Based
Solver 1.0. Let pspike and mat be the variables of type spike param and
matrix data, respectively, used during calls to Spike Default, Spike, Spike Begin,
Spike Process, etc. The dense banded format is specified by mat%format
= D, while the sparse CSR format is specified by mat%format = S. In
the following pspike%tp is set to 1 to manually distribute the matrix and
RHS to the MPI processes.
5.1 Dense Banded Format

Consider a (complete) matrix of dimension n and bandwidth bwd, where
bwd = mat%kl + mat%ku + 1.
If the software were to distribute the data automatically (i.e., tp=0), one
would allocate a space of bwd-by-n for mat%A. Here Table 5.1 gives details
of the matrix data structure relevant for pspike%tp=1, that is, the user
distributes manually the complete matrix into the local mat variable on
each processors. Figure 5.1 illustrates this partitioning scheme. The user
must distribute this bwd-by-n array into pspike%nbprocs arrays of dimension
bwd-by-nj where the values of nj satisfying
nbprocs
!
n= nj
j=1
23
are set by the user. The values of nj are stored globally (i.e. commun
for all processors) in the array of integer mat%sizeA of dimension nbprocs,
such that mat%sizeA=(n1 , n2 , . . . , nnbprocs ). The matrix elements are stored
locally on each processors in mat%A.
The RHSs are distributed by rows in a natural way. Each MPI process
j 1 will have an array of dimension nj -by-nRHS, for j = 1, 2, . . . , nbprocs.
1 2 3 4
Figure 5.1: Illustration of a matrix in LAPACK banded storage format

distributed to four MPI processes.
5.2 Sparse CSR Format

Consider a (complete) sparse matrix of dimension n, if the software were to
distribute the data automatically (i.e., tp=0), one would use a CSR format
and allocate in processor 0 the set of arrays mat%sa, mat%isa, mat%isa.
However, with tp=1, the user must distribute the complete sparse matrix by
block of rows into %nbprocs set of arrays in CSR format where the number
of non-zero elements of each submatrices nnzj and the number of rows nj
satisfying
nbprocs
!
n= nj
j=1
are set by the user. Figure 5.2 illustrates this partitioning scheme and Ta-
ble 5.1 gives details of the matrix data structure relevant for pspike%tp=1.
Figure 5.2: Illustration of a matrix in CSR sparse storage format distributed

to four MPI processes.
24
mat% Type Distribution Description
format char (in) global matrix format:

D: Dense; S: Sparse CSR
astru char (in) global matrix structure:
G: General non-symmetric
diagdo char (inout) global diagonal dominance
Y: Yes; N: No; I: Investigate
vdiagdo double (out) global computed diagonal dominance value

when mat%diagdo=I or
pspike%autoadapt =.true.
sparsity double (out) global computed degree of sparsity
when pspike%autoadapt =.true.
If pspike%autoadapt and pspike%autoadapt inputs are both .true., the above

fields vdiagdo and sparsity become input and the user must specify these values.
Sparsity is simply the ratio of non-zero elements to the total number of elements in
the band, i.e. (# non-zero elements)/(# elements in the band). Diagonal dominance
is computed as follows:
' )
|aii |
vdiagdo = min (N
j=1,j=i |aij |
i
n integer (in) global matrix dimension
sizeA integer(pspike%nbprocs) (in) global set of partitions dimensions with

mat%sizeA=(n1 , n2 , . . . , nnbprocs )
The input field below is for the dense banded case (mat%format=D)
kl integer (inout) global # of subdiagonals in matrix

ku integer (inout) global # of superdiagonals in matrix
A double(bwd,mat%sizeA(i+1)) rank i LAPACK banded matrix format,
no extra pivoting space
bwd = mat%kl+mat%ku+1
The input fields below are associated with the sparse CSR format (mat%format=S)
nbsa integer rank i # of non-zero matrix elements nnzj

for partition j=i+1
sa double(mat%nbsa) rank i CSR format, matrix elements
jsa integer(mat%nbsa) rank i CSR format, column indices
isa integer(mat%sizeA(i+1)+1) rank i CSR format, start-of-row indicies
Table 5.1: List of parameter fields of the type matrix data variable mat.
Here all the matrix data are distributed on each processors with pspike%tp=1.
All the other parameter fields must be declared as global (i.e. common for
each MPI process).
25
The values of nnzj are stored locally (i.e. on each processors) in the
integer mat%nbsa. The matrix elements are also stored locally on each pro-
cessors in the arrays of integer mat%sa, mat%jsa, mat%isa with dimension
mat%nbsa, mat%nbsa and nj + 1, repectively.
The RHSs are distributed by rows in a natural way. Each MPI process
j 1 will have an array of dimension nj -by-nRHS, for j = 1, 2, . . . , nbprocs.
26
Chapter 6
Intel Adaptive Spike-Based

r
Solver Examples
This section shows sample programs illustrating the calling sequences. In

examples 1, 2, 3 and 4, the package solves the following linear system of size
n = 8:
6 1 1 0 0 0 0 0 x1 f1

1 6 1 1 0 0 0 0 x2 f2

1 1 6 1 1 0 0 0 x3 f3

0 1 1 6 1 1 0 0 x4 = f4

0 0 1 1 6 1 1 0 x5 f5

0 0 0 1 1 6 1 1 x6 f6

0 0 0 0 1 1 6 1 x7 f7
0 0 0 0 0 1 1 6 x8 f8
Note that examples 1, 2, 3, and 5 can use 1, 2, or 4 MPI processes.
Example 4 is designed for only 2 MPI processes.
6.1 Example1: Automatic Partitioning

In this example, partitioning of the coecient matrix and the RHS is done
by the software package. The RHS is (1, 1, 1, 1, 1, 1, 1, 1)T . This example
calls the subroutine SPIKE.
program example1
use s p i k e m o d u l e
use mpi
i m p l i c i t none
i n t e g e r : : rank , code , n b p r o c s , i
double p r e c i s i o n , dimension ( : , : ) , a l l o c a t a b l e : : f
! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
integer : : i n f o
type ( s p i k e p a r a m ) : : p s p i k e
type ( m a t r i x d a t a ) : : mat
!
c a l l M P I E r r h a n d l e r s e t (MPI COMM WORLD, MPI ERRORS RETURN, c o d e ) ;
!
27
! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! ! ! ! ! ! ! ! ! ! INPUT PARAMETER SPIKE
! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
p s p i k e%n b p r o c s=n b p r o c s
p s p i k e%r a n k=r a n k
c a l l SPIKE DEFAULT( p s p i k e )
! ! c h a n g e s f ro m d e f a u l t
p s p i k e%a u t o a d a p t =. f a l s e .
! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! ! ! ! ! ! ! ! ! ! INPUT PARAMETER MATRIX and RHS
! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! ! All processors
mat%f o r m a t =D
mat%ASTRU=G
mat%DIAGDO=Y
mat%n=8
mat%k l =2
mat%ku=2
! ! G l o b a l m a t r i x i s d e f i n e d o n l y on p r o c e s s o r 0
i f ( r a n k ==0) then ! ! o n l y on p r o c e s s o r 0 ( g l o b a l matrix )
a l l o c a t e ( mat%A ( 1 : mat%k l+mat%ku +1 ,mat%n ) )
mat%A( mat%ku + 1 , : ) = 6 . 0 d0
mat%A( mat%ku 1 ,:)= 1.0 d0
mat%A( mat%ku , : ) = 1 . 0 d0
mat%A( mat%ku +2 ,:)= 1.0 d0
mat%A( mat%ku +3 ,:)= 1.0 d0
! ! RHS
a l l o c a t e ( f ( 1 : mat%n , 1 : 1 ) )
f =1.0 d0
end i f
! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! ! ! ! ! ! ! ! ! ! CALLING SPIKE ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
i f ( i n f o >=0) then
! ! ! ! ! ! Global Solution
i f ( r a n k ==0) then
print , Global s o l u t i o n
do i =1 ,mat%n
print , i , f ( i , 1 )
end do
end i f
end i f
end program example1
We get the following output:

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> SPIKE SUMMARY >>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Spike Strategy RP3
TIME FOR SPIKE (FACT+SOLV) 1 . 0 5 0 2 3 3 8 4 0 9 4 2 3 8 3 e 03

RESIDUAL 4 . 4 4 0 8 9 2 0 9 8 5 0 0 6 2 6 e 16
SPIKE WARNING 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Global s o l u t i o n
1 0.297297297297297
2 0.360360360360360
3 0.423423423423423
4 0.441441441441441
5 0.441441441441441
6 0.423423423423423
7 0.360360360360360
8 0.297297297297297
28
6.2 Example2: Automatic Partitioning and Mul-
tiple RHS
In this example, two systems with same coecient matrix are solved. The
RHS are (1, 0, 0, 0, 0, 0, 0, 0)T and (0, 1, 0, 0, 0, 0, 0, 0)T . This example calls
the subroutine SPIKE.
program example2
use spike module
use mpi
! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
integer : : i n f o
!
!
! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
p s p i k e%DFS=L
! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
mat%f o r m a t =D
mat%ASTRU=G
mat%DIAGDO=Y
mat%n=8
mat%k l =2
mat%ku=2
if ( r a n k ==0) then
mat%A( mat%k l + 1 , : ) = 6 . 0 d0
mat%A( mat%k l 1 ,:)= 1.0 d0
mat%A( mat%k l , : ) = 1 . 0 d0
mat%A( mat%k l +2 ,:)= 1.0 d0
mat%A( mat%k l +3 ,:)= 1.0 d0
! ! RHS
a l l o c a t e ( f ( 1 : mat%n , 1 : 2 ) )
f =0.0 d0
f ( 1 , 1 ) = 1 . 0 d0
f ( 2 , 2 ) = 1 . 0 d0
end i f
! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! ! ! ! ! ! ! ! ! ! CALLING SPIKE ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
do i =1 ,mat%n
print , i , f ( i , 1 ) , f ( i , 2 )
end do
end i f
! !!!!!!!!!
end i f
29
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> SPIKE SUMMARY >>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Spike Strategy RL3

RESIDUAL 2 . 9 8 3 7 2 4 3 7 8 6 8 0 1 0 8 e 16
SPIKE WARNING 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
1 0.180112797913845 3 . 9 5 6 0 1 2 8 5 6 5 3 6 4 1 8 E002
2 3 . 9 5 6 0 1 2 8 5 6 5 3 6 4 1 7 E002 0 . 1 8 8 7 9 1 6 3 3 8 1 7 8 1 2
3 4 . 1 1 1 6 6 5 8 9 1 7 7 0 6 0 4 E002 4 . 8 5 6 9 1 3 7 5 7 4 3 7 3 1 8 E002
4 1 . 6 1 3 1 3 1 4 5 6 0 6 3 3 9 3 E002 4 . 4 6 2 0 5 3 6 7 6 7 1 3 3 6 3 E002
5 1 . 0 8 9 5 7 1 2 4 6 6 3 9 3 1 0 E002 1 . 8 4 4 2 5 2 6 2 9 5 9 2 9 4 4 E002
6 5 . 2 1 5 3 8 7 4 1 4 3 4 0 2 9 5 E003 1 . 1 9 1 9 9 2 2 9 1 4 6 8 7 3 1 E002
7 2 . 9 1 0 9 1 3 9 0 5 6 7 8 3 0 3 E003 5 . 5 4 5 5 6 0 5 1 9 3 8 2 5 1 0 E003
8 1 . 3 5 4 3 8 3 5 5 3 3 3 6 4 3 3 E003 2 . 9 1 0 9 1 3 9 0 5 6 7 8 3 0 3 E003
6.3 Example3: Automatic Partitioning and Mul-

tiple RHS with Separate Factorization and So-
lution
In this example, we again use two RHS but this time the calling sequence is
separated into factorization and solves where factorization is done once and
there are two solves for each RHS (1, 0, 0, 0, 0, 0, 0, 0)T and (0, 1, 0, 0, 0, 0, 0, 0)T .
program example3
use spike module
use mpi
! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
integer : : i n f o
!
!
! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
mat%f o r m a t =D ! D f o r Dense Banded , S f o r S p a r s e b a n d e d , G for General Sparse
mat%ASTRU=G ! ! ! G e n e r a l s t r u c t u r e ( nons y m m e t r i c )
mat%DIAGDO=Y
mat%n=8
mat%k l =2
mat%ku=2
mat%A( mat%ku + 1 , : ) = 6 . 0 d0
mat%A( mat%ku 1 ,:)= 1.0 d0
30
mat%A( mat%ku , : ) = 1 . 0 d0
mat%A( mat%ku +2 ,:)= 1.0 d0
mat%A( mat%ku +3 ,:)= 1.0 d0
! ! RHS
a l l o c a t e ( f ( 1 : mat%n , 1 : 1 ) )
end i f
! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! ! ! ! ! ! ! ! ! ! CALLING SPIKE ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
c a l l SPIKE BEGIN ( p s p i k e , mat , pre , i n f o )
i f ( ( r a n k ==0). and . ( i n f o < 0 ) ) then

p r i n t , 1 S p i k e INFO e x i t / E r r o r Code : , i n f o , p s p i k e%e r r o r c o d e
end i f
c a l l SPIKE PREPROCESS( p s p i k e , pre , i n f o )
if ( ( r a n k ==0). and . ( i n f o < 0 ) ) then

end i f
f =0.0 d0
f ( 1 , 1 ) = 1 . 0 d0
end i f
c a l l SPIKE PROCESS ( p s p i k e , mat , pre , f , i n f o )

end i f
! ! ! ! ! ! Global Solution 1
print , Global s o l u t i o n 1
do i =1 ,mat%n
print , i , f ( i , 1 )
end do
end i f
! !!!!!!!!!
end i f
f =0.0 d0
f ( 2 , 1 ) = 1 . 0 d0
end i f

end i f
! ! ! ! ! ! Global Solution 2
print , Global s o l u t i o n 2
do i =1 ,mat%n
print , i , f ( i , 1 )
end do
end i f
! !!!!!!!!!
end i f
c a l l SPIKE END ( p s p i k e , mat , pre , i n f o )

Global solution 1
1 0.180112797913845
2 3 . 9 5 6 0 1 2 8 5 6 5 3 6 4 1 7 E002
3 4 . 1 1 1 6 6 5 8 9 1 7 7 0 6 0 4 E002
4 1 . 6 1 3 1 3 1 4 5 6 0 6 3 3 9 3 E002
5 1 . 0 8 9 5 7 1 2 4 6 6 3 9 3 1 0 E002
6 5 . 2 1 5 3 8 7 4 1 4 3 4 0 2 9 5 E003
7 2 . 9 1 0 9 1 3 9 0 5 6 7 8 3 0 3 E003
8 1 . 3 5 4 3 8 3 5 5 3 3 3 6 4 3 3 E003
Global solution 2
1 3 . 9 5 6 0 1 2 8 5 6 5 3 6 4 1 8 E002
2 0.188791633817812
3 4 . 8 5 6 9 1 3 7 5 7 4 3 7 3 1 8 E002
31
4 4 . 4 6 2 0 5 3 6 7 6 7 1 3 3 6 3 E002
5 1 . 8 4 4 2 5 2 6 2 9 5 9 2 9 4 4 E002
6 1 . 1 9 1 9 9 2 2 9 1 4 6 8 7 3 1 E002
7 5 . 5 4 5 5 6 0 5 1 9 3 8 2 5 1 0 E003
8 2 . 9 1 0 9 1 3 9 0 5 6 7 8 3 0 3 E003
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> SPIKE SUMMARY >>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Spike Strategy RP3

RESIDUAL 2 . 3 5 9 2 2 3 9 2 7 3 2 8 4 5 8 e 16
SPIKE WARNING 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
6.4 Example4: Manual Partitioning

In this example, partitioning of the coecient matrix and the RHS is done
manually on 2 processors. The RHS is (1, 1, 1, 1, 1, 1, 1, 1)T .
program example4
use spike module
use mpi
! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
integer : : i n f o
!
!
! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
p s p i k e%t p =1 ! ! c u s t o m i z e d l o c a l partitioning of type 1
! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
mat%f o r m a t =D ! ! d e n s e b a n d e d f o r m a t
mat%ASTRU=G
mat%DIAGDO=Y
! g l o b a l data
mat%n=8
mat%k l =2
mat%ku=2
a l l o c a t e ( mat%s i z e A ( 1 : 2 ) ) ! ! o n l y 2 p a r t i t i o n s a r e c o n s i d e r e d
mat%s i z e A ( 1 ) = 4
mat%s i z e A ( 2 ) = 4
! l o c a l d a t a f o r p a r t i t i o n number r a n k +1
a l l o c a t e ( mat%A ( 1 : mat%k l+mat%ku +1 ,mat%s i z e A ( r a n k + 1 ) ) )
mat%A( mat%ku + 1 , : ) = 6 . 0 d0
mat%A( mat%ku 1 ,:)= 1.0 d0
mat%A( mat%ku , : ) = 1 . 0 d0
mat%A( mat%ku +2 ,:)= 1.0 d0
mat%A( mat%ku +3 ,:)= 1.0 d0
! ! RHS ( l o c a l )
a l l o c a t e ( f ( 1 : mat%s i z e A ( r a n k + 1 ) , 1 : 1 ) )
f =1.0 d0
32
! !!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! ! ! ! ! ! ! ! ! ! CALLING SPIKE !!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! !!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!
c a l l SPIKE ( p s p i k e , mat , f , info )
! ! ! ! ! ! Local Solution
print , L o c a l s o l u t i o n f o r p a r t i t i o n , r a n k+1
do i =1 ,mat%s i z e A ( r a n k +1)
print , i , f ( i , 1 )
end do
endif

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> SPIKE SUMMARY >>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Spike Strategy RP3

RESIDUAL 4 . 4 4 0 8 9 2 0 9 8 5 0 0 6 2 6 e 16
SPIKE WARNING 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Local s o l u t i o n f o r p a r t i t i o n 1
1 0.297297297297297
2 0.360360360360360
3 0.423423423423423
4 0.441441441441441
Local s o l u t i o n f o r p a r t i t i o n 2
1 0.441441441441441
2 0.423423423423423
3 0.360360360360360
4 0.297297297297297
6.5 Example5: Automatic Partitioning Using the

CSR Input Format
The following system in compressed sparse row (CSR) format is solved using
the subroutine SPIKE.
6 0 1 0 0 0 0 0 x1 1

0 6 0 1 0 0 0 0 x2 1

1 0 6 0 1 0 0 0 x3 1

0 1 0 6 0 1 0 0 x4 = 1

0 0 1 0 6 0 1 0 x5 1

0 0 0 1 0 6 0 1 x6 1

0 0 0 0 1 0 6 0 x7 1
0 0 0 0 0 1 0 6 x8 1
program example5
use spike module
use mpi
33
! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
integer : : i n f o
!
!
! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
p s p i k e%RSS=F
p s p i k e%DFS=L
! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
mat%f o r m a t =S ! ! CSR
mat%ASTRU=G
mat%DIAGDO=Y
mat%n=8
mat%nbsa =20 ! ! number o f nonz e r o e l e m e n t s i n CSR f o r m a t
a l l o c a t e ( mat%s a ( 1 : mat%nbsa ) ) ! array for values
a l l o c a t e ( mat%j s a ( 1 : mat%nbsa ) ) ! a r r a y f o r c olumn i n d e x e s
a l l o c a t e ( mat%i s a ( 1 : mat%n + 1 ) ) ! a r r a y f o r row CSR i n d e x e s
mat%s a =(/6 , 1 ,6 , 1 , 1 ,6 , 1 , 1 ,6 , 1 , 1 ,6 , 1 , 1 ,6 , 1 , 1 ,6 , 1 ,6/)

mat%j s a = ( / 1 , 3 , 2 , 4 , 1 , 3 , 5 , 2 , 4 , 6 , 3 , 5 , 7 , 4 , 6 , 8 , 5 , 7 , 6 , 8 / )
mat%i s a = ( / 1 , 3 , 5 , 8 , 1 1 , 1 4 , 1 7 , 1 9 , 2 1 / )
! ! RHS
a l l o c a t e ( f ( 1 : mat%n , 1 : 1 ) )
f =1.0 d0
end i f
! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! ! ! ! ! ! ! ! ! ! CALLING SPIKE ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
do i =1 ,mat%n
print , i , f ( i , 1 )
end do
end i f
endif

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> SPIKE SUMMARY >>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Spike Strategy FL3

RESIDUAL 2 . 2 2 0 4 4 6 0 4 9 2 5 0 3 1 3 e 16
SPIKE WARNING 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
1 0.206896551724138
2 0.206896551724138
34
3 0.241379310344828
4 0.241379310344828
5 0.241379310344828
6 0.241379310344828
7 0.206896551724138
8 0.206896551724138
6.6 Example 6: Automatic Partitioning Using the

CSR Input Format with a Preconditioner
Let us define the following general sparse system:
6 0 0 0 0 0 0 1 x1 1

0 6 0 0 0 0 1 0 x2 0

0 0 6 0 0 1 0 0 x3 0

0 0 0 6 1 0 0 0 x4 = 0

0 0 0 1 6 0 0 0 x5 0

0 0 1 0 0 6 0 0 x6
0

0 1 0 0 0 0 6 0 x7 0
1 0 0 0 0 0 0 6 x8 0
This linear system is solved iteratively with the following dense, banded
preconditioner:
6 1 0 0 0 0 0 0

1 6 1 0 0 0 0 0

0 1 6 1 0 0 0 0

0 0 1 6 1 0 0 0
M =

0 0 0 1 6 1 0 0
0 0 0 0 1 6 1 0

0 0 0 0 0 1 6 1
0 0 0 0 0 0 1 6
program example6
use spike module
use mpi
! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
integer : : i n f o
!
!
! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
p s p i k e%DFS=L
35
p s p i k e%BPS=1 ! a b a n d e d preconditioner is provided by the user
! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
mat%f o r m a t =S ! ! CSR
mat%ASTRU=G
mat%DIAGDO=Y
mat%n=8
mat%nbsa =16
a l l o c a t e ( mat%s a ( 1 : mat%nbsa ) )
a l l o c a t e ( mat%j s a ( 1 : mat%nbsa ) )
a l l o c a t e ( mat%i s a ( 1 : mat%n +1))
mat%s a =(/6 , 1 ,6 , 1 ,6 , 1 ,6 , 1 , 1 ,6 , 1 ,6 , 1 ,6 , 1 ,6/)

mat%j s a = ( / 1 , 8 , 2 , 7 , 3 , 6 , 4 , 5 , 4 , 5 , 3 , 6 , 2 , 7 , 1 , 8 / )
mat%i s a = ( / 1 , 3 , 5 , 7 , 9 , 1 1 , 1 3 , 1 5 , 1 7 / )
! ! RHS
a l l o c a t e ( f ( 1 : mat%n , 1 : 1 ) )
f =0.0 d0
f ( 1 , 1 ) = 1 . 0 d0
end i f
! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! ! ! ! ! ! ! ! ! ! INPUT PARAMETER PRECONDITIONER
! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
p r e%f o r m a t =D ! Dense Banded f o r m a t

p r e%ASTRU=G
p r e%DIAGDO=Y
p r e%n=8
p r e%k l =1
p r e%ku=1
if( r a n k ==0) then

a l l o c a t e ( p r e%A ( 1 : p r e%k l+p r e%ku +1 , p r e%n ) )
p r e%A( p r e%ku + 1 , : ) = 6 . 0 d0
p r e%A( p r e%ku , : ) = 1 . 0 d0
p r e%A( p r e%ku +2 ,:)= 1.0 d0
end i f
! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! ! ! ! ! ! ! ! ! ! CALLING SPIKE ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
c a l l SPIKE BEGIN ( p s p i k e , mat , pre , i n f o )
c a l l SPIKE PREPROCESS( p s p i k e , pre , i n f o )
c a l l SPIKE END ( p s p i k e , mat , pre , i n f o )
do i =1 ,mat%n
print , i , f ( i , 1 )
end do
end i f
endif

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> SPIKE SUMMARY >>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Spike Strategy RL3

RESIDUAL 6 . 1 9 4 0 7 4 0 3 1 7 6 2 2 4 3 e 08
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
1 0.171428570316091
36
2 6 . 3 9 5 4 6 8 0 6 4 4 1 9 2 9 3 E009
3 3 . 6 6 2 2 4 3 7 9 1 1 2 4 3 1 1 E009
4 2 . 6 0 2 2 5 5 0 8 3 2 6 2 6 2 2E010
5 1 . 8 0 4 4 5 3 9 6 6 7 3 8 2 0 7E009
6 9 . 7 1 3 0 8 2 7 5 4 6 6 5 3 4 6E009
7 6 . 9 1 9 7 3 6 1 8 1 3 7 2 2 0 7E010
8 2 . 8 5 7 1 4 2 1 9 1 5 7 3 5 0 1 E002
6.7 Toeplitz Matrix Example

This example solves a large Toeplitz matrix with RHS (1, 1, ..., 1, 1)T . Source
code is not shown for this example and can be found in
<SPIKE dir>/examples/examples f90/source
The input matrix elements and properties must be defined in the file
<SPIKE dir>/examples/examples f90/data/matrix toeplitz.in
The following is a sample input file for a banded matrix (n = 48, 000), 3
on the main diagonal, 4 on the upper and lower o-diagonals, 0.1 on the
other o-diagonals, and upper and lower bandwidths of 80 (total bandwidth
is 161):
48000 ! ! n , s i z e of the matrix
80 ! ! k l , Lower b a n d
80 ! ! ku , Upper b a n d
3 0 . 0 d0 ! ! diagonal element
4.0 d0 ! ! f i r s t lower o f f diagonal element
4.0 d0 ! ! f i r s t upper o f f d i a g o n a l element
0 . 1 d0 ! ! OTHERS o f f d i a g o n a l e l e m e n t
1 ! ! s , number o f RHS ( THE v a l u e o f t h e RHS a r e generated by the code )
Y ! ! DIAGDO ? Y ( Yes ) , N ( No ) , I ( I n v e s t i g a t e )
Some of the components for the derived type spike param variable can be
changed from their default values while modying the input file
<SPIKE dir>/examples/examples f90/data/spike toeplitz.in
Here a sample input file which selects the (R,L) strategy:

R ! ! RSS ? E ( E x p l i c i t ) , F ( on t h e F l y ) , T ( T r u n c a t e d ) , R ( R e c u r s i v e )
L ! ! DFS ? L ( LU ) , U ( LU , and UL ) , P ( LU w i t h p i v o t i n g
3 ! ! OIS ? 0 ( DIRECT ) , 2 ( ITREFINEM ) , 3 ( BiCGStab )
1D7 ! ! eps out ! ! ACCURACY Bi C Gs ta b OUTSIDE
50 ! ! nbit out ! ! NBRE MAX o f ITERATIONS OUTSIDE
1D5 !! e p s i n ! ! ACCURACY Bi C Gs ta b INSIDE
30 !! nbit in ! ! NBRE MAX o f ITERATIONS INSIDE
1D10 !! New z e r o m a c h i n e f o r d i a g o n a l BOOSTing p r o c e d u r e
0 !! type of partitionning ( o : global , 1 : l o c a l )
. true . ! ! timing
. true . ! ! detailed information of the simulation
6 ! ! i n f o p r i n t e d on s c r e e n i f = 6 , o r on f i l e s p i k e . o u t p u t i f /=6
. false . ! ! to enable spike adapt
Finally one can run example toeplitz program with the command
mpirun -np 4 toeplitz
to get the following output:
SPIKE INFO
! ! NBPROCESSORS ? 4
! ! NBPARTITIONS ? 4
! ! SPIKE ADAPT ? F
! ! ALGORITHM ? R
! ! FACTORIZATION ? L
! ! TYPE OF SOLVER ? 3
! ! ACCURACY OUT . ? 1 . 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 e 07
37
! ! NB ITMAX OUT . ? 50
! ! ACCURACY IN . ? 1 . 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 e 05
! ! NB ITMAX IN . ? 30
! ! NEW ZERO PIVOT ? 1 . 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 e 09
! ! BOOST ? 1 . 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 e 10
! ! Orign . P a r t i t i o n . ? 0
! ! S i z e f i r s t l a s t p a r t i t i o n ? 12000
! ! Size p a r t i t i o n middle ? 12000
! ! F r e e memory ? T
! ! Compute R e s i d u a l ? T
! ! ADD . MEMORY NEEDED ( Mb ) 1 . 0 5 6 3 9 6 7 8 9 5 5 0 7 8 1 e +02
MATRIX INFO
! ! MATRIX FORMAT ? D
! ! MATRIX STRUCT . ? G
! ! D i a g . Dominant ? Y
! ! SIZE MATRIX ? 48000
DENSE BANDED MATRIX
! ! Lower b a n d ? 80
! ! Upper b a n d ? 80
DETAILED TIME o f PREPROCESS
TIME FACTLU ( < t o copy UL+FACT LU , if any ) 1 5 . 7 3 2 5 2 9 1 6 3 3 6 0 5 9 6 e 01

TIME FOR COMPUTING THE SPIKES 1 3 . 6 0 7 8 9 0 6 0 5 9 2 6 5 1 4 e 01
> TIME FOR SPIKE PREPROCESSING 9 . 3 6 5 0 8 1 7 8 7 1 0 9 3 7 5 e 01
RHS INFO
! ! Number o f RHS ? 1
DETAILED TIME o f PROCESS
TIME FOR MODIFIED RHS 1 . 8 4 0 5 8 9 0 4 6 4 7 8 2 7 1 e 01

TIME FOR REDUCED SYSTEM 2 . 1 1 5 0 1 1 2 1 5 2 0 9 9 6 1 e 03
TIME FOR RETRIEVE 5 . 4 1 0 1 9 4 3 9 6 9 7 2 6 5 6 e 03
RESIDUAL BEFORE OUTSIDE ITERATION

0 3 . 7 5 3 9 4 1 6 0 2 0 1 3 8 1 1 e 15
TIME p o s t p r o c e s s MATMUL 0 . 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 e +00
TIME p o s t p r o c e s s SOLVE 0 . 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 e +00
> TIME FOR SPIKE PROCESSING 2 . 1 3 1 6 5 9 9 8 4 5 8 8 6 2 3 e 01
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> SPIKE SUMMARY >>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Spike Strategy RL3
TIME FOR SPIKE (FACT+SOLV) 1 . 1 4 9 6 7 4 1 7 7 1 6 9 8 0 0 e +00

RESIDUAL 3 . 7 5 3 9 4 1 6 0 2 0 1 3 8 1 1 e 15
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
6.8 Sparse Banded Matrix Example

This example reads and solves a sparse banded matrix in CSR format. The
source code and a sample input matrix is provided in
<SPIKE dir>/examples/examples f90/source/sparse.f90
The input matrix file is defined in
<SPIKE dir>/examples/examples f90/data/matrix sparse.in
38
and it contains the following fields
csrfile ! ! g e n e r i c name o f s p a r s e f o r m a t
I ! ! DIAGDO ? Y ( Yes ) , N ( No ) , I ( I n v e s t i g a t e )
. false . ! ! s p a r s e 2 dense banded ( t r u e or f a l s e )
The sparse system matrix is stored using four files where the generic name
of those file is defined by the first line of the input file above (i.e. here the
name is csrfile). The names of these four files (located in the same direc-
tory above) are: csrfile.sa for the matrix elements, csrfile.jsa for the
column indices, csrfile.isa for the start-of-row indicies and csrfile.sf
for the right-hand-side elements. The number of non-zero elements is indi-
cated at the beginning of the first two files, while the beginning of the last
two indicates the number of rows. In addition, the first line of csrfile.sf
contains the number of right-hand-side as well (if this number is greater
than one, the elements should be stored in multicolumns).
Similarly to the toeplitz example, some of the components for the derived
type spike param variable can be changed from their default values while
modying the input file
<SPIKE dir>/examples/examples f90/data/spike sparse.in
In Intel
r
Adaptive Spike-Based Solver 1.0, only the (F,L) strategy is al-
lowed for solving sparse banded systems. However, the last field of the file
matrix sparse.in is an utility routine which gives the option to the user to
transform the CSR input matrix to a dense banded matrix. It will then set
the option mat%format=D for enabling the use of all the other strategies
for dense banded systems.
Finally one can run example toeplitz program with the command
mpirun -np 4 sparse
to get the following output:

Matrix loaded
n= 9 6 0 nnz = 15844
SPIKE INFO
! ! NBPROCESSORS ? 4
! ! NBPARTITIONS ? 4
! ! SPIKE ADAPT ? F
! ! ALGORITHM ? F
! ! FACTORIZATION ? L
! ! TYPE OF SOLVER ? 3
! ! ACCURACY OUT . ? 1 . 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 e 07
! ! NB ITMAX OUT . ? 50
! ! ACCURACY IN . ? 1 . 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 e 05
! ! NB ITMAX IN . ? 30
! ! NEW ZERO PIVOT ? 1 . 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 e 09
! ! BOOST ? 1 . 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 e 10
! ! Orign . P a r t i t i o n . ? 0
! ! S i z e f i r s t l a s t p a r t i t i o n ? 240
! ! Size p a r t i t i o n middle ? 240
! ! F r e e memory ? T
! ! Compute R e s i d u a l ? T
! ! ADD . MEMORY NEEDED ( Mb ) 1 . 1 2 9 7 6 0 7 4 2 1 8 7 5 0 0 e 01
MATRIX INFO
! ! MATRIX FORMAT ? S
! ! MATRIX STRUCT . ? G
! ! D i a g . Dominant ? N
! ! D e g r e e o f D i a g . Dominant ? 4 . 5 1 5 9 1 6 5 0 2 0 8 2 1 8 1 e 01
! ! Degree o f S p a r s i t y ( w i t h i n t h e band ) ? 1 . 9 3 9 9 7 8 6 9 5 0 0 8 0 2 0 e 01
! ! SIZE MATRIX ? 960
SPARSE BANDED MATRIX
39
! ! Lower b a n d ? 43
! ! Upper b a n d ? 43
! ! # o f nonz e r o el . ? 15844
DETAILED TIME o f PREPROCESS
P a r d i s o R e o r d e r 2 . 1 5 6 9 4 9 0 4 3 2 7 3 9 2 6 e 01
Pardiso Factor 3 . 2 8 1 9 9 8 6 3 4 3 3 8 3 7 9 e 02
TIME FACTLU ( < t o copy UL+FACT LU , i f any ) 1 2 . 4 8 8 8 2 0 5 5 2 8 2 5 9 2 8 e 01
TIME FOR COMPUTING THE SPIKES 1 1 . 5 9 7 4 0 4 4 7 9 9 8 0 4 6 9 e 05
> TIME FOR SPIKE PREPROCESSING 2 . 4 9 0 7 2 0 7 4 8 9 0 1 3 6 7 e 01
RHS INFO
! ! Number o f RHS ? 1
DETAILED TIME o f PROCESS
RESIDUAL BEFORE BICGSTAB IN ITERATION

0 . 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 e +00\ t 1 . 0 E0
1 3 . 2 5 4 9 0 0 5 8 4 0 1 7 1 1 9 e 01
2 1 . 5 2 4 9 3 7 8 6 0 9 7 4 8 4 1 e 01
3 1 . 0 9 6 3 1 0 8 4 8 6 5 0 8 2 1 e 01
4 1 . 2 4 5 4 8 3 1 3 5 9 7 2 8 8 0 e 01
5 1 . 1 1 1 4 5 3 4 0 6 9 5 9 9 0 9 e 01
6 8 . 9 1 5 7 4 6 4 5 9 5 4 9 1 3 6 e 02
7 5 . 2 4 9 8 6 2 4 6 3 0 4 3 4 9 0 e 02
8 4 . 3 0 2 5 9 5 1 4 2 2 1 6 1 5 8 e 02
9 3 . 9 0 0 2 8 9 8 9 5 9 2 7 1 0 2 e 02
10 3 . 9 2 6 9 7 9 2 4 0 9 5 8 6 7 9 e 02
11 3 . 2 8 7 5 7 9 3 2 1 7 4 4 1 3 2 e 02
12 2 . 4 4 4 3 4 3 5 5 4 7 9 4 3 4 5 e 02
13 1 . 4 8 5 0 0 4 3 5 2 5 3 6 5 4 4 e 02
14 1 . 2 3 8 6 9 8 8 7 5 9 2 9 8 5 6 e 02
15 9 . 6 5 0 1 9 5 1 3 1 7 2 4 8 5 5 e 03
16 8 . 8 4 9 8 0 1 0 6 0 9 7 3 6 2 5 e 03
17 8 . 3 2 3 3 5 8 2 8 1 1 8 3 2 3 6 e 03
18 8 . 0 7 9 1 4 4 5 3 1 8 3 5 6 4 3 e 03
19 7 . 6 8 0 2 7 9 4 9 3 8 9 3 8 6 5 e 03
20 7 . 4 8 7 9 7 0 0 3 8 7 7 0 6 5 0 e 03
21 7 . 1 0 4 1 3 9 9 3 6 5 4 7 8 9 0 e 03
22 6 . 6 7 1 3 8 0 1 6 0 8 3 6 7 4 1 e 03
23 5 . 5 2 5 1 4 0 4 2 0 9 7 1 4 9 3 e 03
24 4 . 1 5 3 7 6 9 3 6 6 1 0 2 1 6 8 e 03
25 2 . 6 6 9 5 0 2 9 2 2 2 3 7 3 4 2 e 03
26 5 . 8 4 8 7 0 2 6 8 1 8 7 0 5 0 2 e 04
27 3 . 0 4 7 1 9 3 4 0 4 8 3 5 8 8 5 e 04
28 4 . 9 9 2 3 8 5 1 0 1 1 4 9 9 4 0 e 05
29 3 . 3 0 3 6 4 0 6 2 8 8 5 1 2 2 9 e 05
30 5 . 5 8 1 3 3 0 3 0 6 9 2 8 7 7 6 e 05
TIME p o s t p r o c e s s MATMUL 4 . 3 3 1 1 8 3 4 3 3 5 3 2 7 1 5 e 02
TIME p o s t p r o c e s s SOLVE 1 . 5 7 3 5 6 2 6 2 2 0 7 0 3 1 2 e 05

RESIDUAL BEFORE OUTSIDE ITERATION

0 6 . 7 1 3 4 1 2 6 1 2 8 9 9 2 5 6 e 06
RESIDUAL BEFORE BICGSTAB IN ITERATION

0 . 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 e +00\ t 1 . 0 E0
1 7 . 5 1 8 9 9 5 7 0 8 1 4 1 4 7 8 e 01
2 1 . 3 2 9 9 2 0 8 1 3 2 6 1 0 4 7 e +00
3 9 . 7 1 5 0 8 5 3 0 8 4 6 3 8 6 7 e 01
4 5 . 8 2 2 6 6 3 1 9 2 6 4 3 8 7 4 e 01
5 4 . 9 9 1 0 6 5 2 3 6 8 4 3 6 3 3 e 01
6 9 . 1 8 3 0 8 1 2 0 0 3 4 1 4 8 0 e 01
7 6 . 7 3 7 9 4 5 9 7 5 5 3 6 6 6 8 e 01
8 7 . 2 4 2 5 0 8 1 3 2 9 5 1 4 3 7 e 01
9 9 . 3 1 5 1 9 3 5 7 8 8 1 6 6 6 8 e 01
10 1 . 6 5 1 5 5 5 1 1 7 6 9 2 2 3 9 e 01
11 1 . 5 1 3 3 7 4 9 6 4 0 1 9 6 1 6 e 01
12 1 . 2 9 5 9 6 5 4 6 8 5 5 6 5 0 4 e 01
13 9 . 3 5 4 8 7 8 7 3 9 9 3 1 8 4 2 e 02
14 5 . 4 8 6 2 0 9 2 4 6 8 8 6 8 2 2 e 02
15 3 . 8 4 8 0 7 0 8 1 9 8 3 6 1 1 5 e 02
16 3 . 3 2 3 7 1 0 6 8 8 5 8 7 7 2 6 e 02
17 1 . 1 0 4 5 8 4 8 7 5 8 6 2 2 6 9 e 01
18 1 . 8 8 9 0 8 3 1 6 4 8 8 9 9 1 5 e 03
19 4 . 9 9 4 3 2 5 5 5 3 1 1 7 2 5 1 e 04
40
20 1 . 4 0 2 8 6 5 8 5 0 5 3 6 1 0 5 e 04
21 9 . 7 6 7 5 6 3 9 2 5 2 0 7 8 8 5 e 05
22 1 . 5 4 9 1 4 7 1 7 0 8 1 9 6 8 3 e 04
23 2 . 6 2 3 3 5 4 8 6 1 7 2 6 8 2 4 e 05
24 8 . 8 2 3 9 4 3 1 4 7 0 1 1 5 4 4 e 06

5 . 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 e 01 9 . 7 2 3 8 3 6 1 9 5 9 4 4 9 6 6 e 11
> TIME FOR SPIKE PROCESSING 1 . 5 7 4 3 2 0 7 9 3 1 5 1 8 5 5 e 01
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> SPIKE SUMMARY >>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Spike Strategy FL3

RESIDUAL 9 . 7 2 3 8 3 6 1 9 5 9 4 4 9 6 6 e 11
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
6.9 Calling Intel

r
Adaptive Spike-Based Solver from
C Programs
Intel
r
Adaptive Spike-Based Solver can also be called from C programs.
The data structures in the C interface are available in the header file:
<SPIKE dir>/include/spike.h
It is very important to know the dierence between the Fortran and C

input formats. Inside the spike.h header file, the integer variables comd,
autoadapt, failed, timing, memfree, residual, singular, blocked, boost,
and custom pre of the spike param c interface data structure are actu-
ally logical variables in Fortran. Therefore, to initialize these variables in C,
set them to the -1 for .true.and 0 for .false..
To call the package, add the following lines to the C code:
#i n c l u d e <mpi . h>
#i n c l u d e s p i k e . h
.
.
.
/ B e f o r e t h e MPIINIT c a l l i n g s e q u e n c e /
i n t rank , n b p r o c s , code , i n f o [ 4 ] ;
spike param c interface pspike ;

/ Data s t r u c t u r e a s s o c i a t e d w i t h a g i v e n SPIKE e n v i r o n m e n t /
m a t r i x d a t a c i n t e r f a c e mat , p r e ;
/ Data s t r u c t u r e a s s o c i a t e d w i t h t h e o r i g i n a l m a t r i x mat
and p r e ( i f s e p a r a t e c a l l i n g i s u s e d ) /
/ I n s i d e main f u n c t i o n /
//
c o d e = M P I I n i t (& a r g c , & a r g v ) ;
c o d e = MPI Comm size (MPI COMM WORLD, & n b p r o c s ) ;
c o d e = MPI Comm rank (MPI COMM WORLD, & r a n k ) ;
//
/ A f t e r t h e MPIINIT c a l l i n g sequences ......... /

p s p i k e . n b p r o c s=n b p r o c s ;
p s p i k e . r a n k=r a n k ;
s p i k e d e f a u l t (& p s p i k e ) ;
41
/ Default values f o r pspike /
.
.
/ CALL FOR SPIKE w i t h DEFINITION o f INPUT PARAMETERS /
.
.
/ End o f main f u n c t i o n /
The C version of the Toeplitz example program as well as examples 1-5

are available in the directory:
<SPIKE dir>/examples/examples c
If necessary, modify both the makefile and makefile.target to use the desired
compiler and MPI implementation. The examples use the Intel compilers
and MPI library by default. Moreover, the makefile should (i) link the
libspike.a library, (ii) link the BLAS and LAPACK libraries, and (iii)
specify the path to the spike.h header file.
42
Chapter 7
Reference guide
7.1 Intel
r
Adaptive Spike-Based Solver 1.0 direc-
tory structure
High-level Directory Structure
The table below shows a high-level structure after installation. All directo-
ries are under the packages main directory, for example
/opt/intel/spike/1.0.
Directory Comment
bin/64 Itanium2r binary executable

bin/em64t Intel64r binary executable
doc Documentation
examples/examples c C source code and data for examples

examples/examples f90 Fortran 90 source code and data for examples
include C headers, Fortran 90 module interfaces, and

MPI wrappers
lib/64 Itanium2r static libraries

lib/em64t Intel64r static libraries
spike adapt/64 Spike Adapt data files, Itanium2r

spike adapt/em64t Spike Adapt data files, Intel64r
Detailed Directory Structure

The information in the table below shows detailed structure of the directories
of Intel
r
Adaptive Spike-Based Solver. Again, all directories are under the
43
packages main directory, for example, /opt/intel/spike/1.0.
Directory and Files Contents

bin/64 Binaries directory, Itanium2r
ivars.nml Fortran NAMELIST file storing
the input characteristics used by
spike adapt.exe
spike adapt.exe Standalone executable to query
Spike Adapt
bin/em64t Binaries directory, Intel64r
ivars.nml Fortran NAMELIST file storing
the input characteristics used by
spike adapt.exe
spike adapt.exe Standalone executable to query
Spike Adapt
doc Documentation directory
Install.txt Installation Guide
spikeEULA.txt software license
spike ug.pdf The User Guide(PDF format)
spike ug.ps The User Guide(Postscript format)
examples/examples c C example source code and data
source Source code subdirectory
data Data files subdirectory
makefile[.target] Makefiles to build examples
examples/examples f90 Fortran 90 example source code and
data
source Source code subdirectory
data Data files subdirectory
makefile[.target] Makefiles to build examples
include Headers, Interfaces, wrappers
spike.fi Fortran 90 module interface
spike.h, spike c wrapper.h C headers
spike mpi comm.f90 source of MPI wrapper
lib/64 Itanium2r static libraries
libguide.a Intelr Legacy OpenMP run-time li-
brary for static linking
libguide.so Intelr Legacy OpenMP run-time li-
brary for dynamic linking
libmkl core.a Kernel library for IA-64 architecture
libmkl core.so Library dispatcher for dynamic load of
processor-specific kernel library
libmkl intel lp64.a LP64 interface library for Intel com-
piler
libmkl intel lp64.so LP64 interface library for Intel com-
piler
libmkl intel thread.a Parallel drivers library supporting In-
tel compiler
44
libmkl intel thread.so Parallel drivers library supporting In-
tel compiler
libmkl lapack.a Dummy library. Contains references
to Intel MKL libraries
libmkl.so Dummy library. Contains references
libmkl solver.a Dummy library. Contains references
libmkl solver lp64.a Sparse Solver, Interval Solver, and
GMP routines library supporting
LP64 interface
libspike.a Spike Core routines
libspike adapt.a Spike Adapt routines
libspike adapt de.so Spike Adapt routines, performance
model specific
libspike adapt grid f.a Spike Adapt routines, grid specific
libspike mpi comm.a Default MPI wrapper copied from
libspike mpi comm intelmpi.a.
User can build their own. See
Appendix C for detail.
libspike mpi comm intelmpi.a MPI wrapper supporting Intel MPI
Library for Linux
libspike mpi comm mpich1.a MPI wrapper supporting MPICH 1
libspike mpi comm openmpi.a MPI wrapper supporting Open MPI
lib/em64t Intel64r static libraries
libguide.a Intelr Legacy OpenMP run-time li-
brary for static linking
libguide.so Intelr Legacy OpenMP run-time li-
brary for dynamic linking
libmkl core.a Kernel library for
Intel64r architecture
libmkl core.so Library dispatcher for dynamic load of
processor-specific kernel library
libmkl intel lp64.a LP64 interface library for Intel com-
piler
libmkl intel lp64.so LP64 interface library for Intel com-
piler
libmkl intel thread.a Parallel drivers library supporting In-
tel compiler
libmkl intel thread.so Parallel drivers library supporting In-
tel compiler
libmkl lapack.a Dummy library. Contains references
libmkl.so Dummy library. Contains references
libmkl solver.a Dummy library. Contains references
45
libmkl solver lp64.a Sparse Solver, Interval Solver, and
GMP routines library supporting
LP64 interface
libspike.a Spike Core routines
libspike adapt.a Spike Adapt routines
libspike adapt de.so Spike Adapt routines, performance
model specific
libspike adapt grid f.a Spike Adapt routines, grid specific
libspike mpi comm.a Default MPI wrapper copied from
libspike mpi comm intelmpi.a.
User can build their own. See
Appendix C for detail.
libspike mpi comm intelmpi.a MPI wrapper supporting Intel MPI
Library for Linux
libspike mpi comm openmpi.a MPI wrapper supporting Open MPI
spike adapt/64 Itanium2r Spike Adapt data files
de Subdirectory, calibration data files
spike adapt/em64t Intel64r Spike Adapt data files
de Subdirectory, calibration data files
tools/environment Initialization shell scripts
spikevars64.csh Itanium2r platforms; C shell
spikevars64.sh Itanium2r platforms; Bourne shell
spikevarsem64t.csh Intel64r platforms; C shell
spikevarsem64t.sh Intel64r platforms; Bourne shell
Table 7.1: Detailed package directory structure
7.2 Intel
r
Adaptive Spike-Based Solver and ScaLA-
PACK
This section is addressed to ScaLAPACK users who would like to experiment
with Intelr
Adaptive Spike-Based Solver, making only minor changes to
their code for solving dense banded linear systems (data in double precision).
We describe a practical way to insert Intel r
Adaptive Spike-Based Solver
calling sequences in place of ScaLAPACK ones.
The ScaLAPACK calling sequences that are concerned with this migra-
tion procedure are:
For non-diagonally dominant systems
PDGBSV: Single calling sequence Factorization+Solve

PDGBTRF, PDGBTRS: Separated calling sequences Factoriza-
tion and Solve
For diagonally dominant systems
PDDBSV: Single calling sequence Factorization+Solve
46
PDDBTRF, PDGBTRS: Separated calling sequences Factoriza-
tion and Solve
As described in the documentation, our software package can also han-
dle single or separated calling sequences. In contrast to ScaLAPACK, the
diagonally dominant property does not involve new calling sequences but
can be defined in the data structure matrix data within the parameter
mat%diagdo.
Let us consider the following ScaLAPACK code:
Call PDGBSV(N, BWL, BWU, NRHS, A, JA, DESCA, IPIV, B, IB,

DESCB, WORK, LWORK, INFO )
where we suppose the users to be familiar with all the above parameters
(as described in the ScaLAPACK user guide [3]). This calling sequence can
be replaced by the following one:
Call Spike(pspike, mat, B, info spike)
where the parameters pspike, mat, info spike need to be declared at

the beginning of the program as described in this documentation, while
the parameter B which contains the RHS and solution is identical to the
ScaLAPACK one. Before the call to SPIKE, the other parameters need to
be declared as follows:
p s p i k e%r a n k=r a n k ! w i t h r a n k t h e u s e r v a r i a b l e name f o r p r o c e s s o r rank
p s p i k e%n b p r o c s=n b p r o c s ! w i t h n b p r o c s t h e u s e r v a r i a b l e name
! for # of processors
call Spike Default ( pspike )
p s p i k e%t p =1 ! d a t a l o c a l d i s t r i b u t i o n o f t y p e 1 i s c o m p a t i b l e w i t h ScaLAPACK
! i f t h e u s e r w a n t s t o t u r n o f f s p i k e a d a p t b y p s p i k e%a u t o a d a p t =. f a l s e .
! t h e u s e r can s e l e c t h e r e h i s own S p i k e C o r e s t r a t e g y ( RSS , DFS , OIS )
mat%f o r m a t =D ! double p r e c i s i o n data

mat%a s t r u =G ! g e n e r a l nons y m m e t r i c
mat%n=N ! N a s i n ScaLAPACK
mat%k l=BWL ! BWL a s i n ScaLAPACK
mat%ku=BWU ! BWU a s i n ScaLAPACK
mat%d i a g d o =N ! N i f ScalAPACK command s t a r t s w i t h PDGB . .
! Y i f ScalAPACK command s t a r t s w i t h PDDB . .
mat%Aj=AA ! AA i s t h e m a t r i x A i n ScaLAPACK w i t h o u t e x t r a s p a c e f o r p i v o t i n g
! i f mat%d i a g d o = Y AA i s i d e n t i c a l t o A and one can s i m p l y
! u s e mat%Aj=>A ( w i t h a t t r i b u t i o n t a r g e t f o r A)
! i f mat%d i a g d o = N t h e u s e r may f i r s t want t o s u p p r e s s t h e e x t r a
! s t o r a g e s p a c e i n t h e a l l o c a t i o n o f A and t h e n
! u s e mat%Aj=A
a l l o c a t e ( mat%s i z e A ( 1 : n b p r o c s ) )
mat%s i z e A ( 1 : n b p r o c s 1)=DESCA( 4 ) ! ScaLAPACK v a r i a b l e
! size of the l o c a l partition
mat%s i z e A ( n b p r o c s )=n( n b p r o c s 1)mat%s i z e A ( 1 ) ! s i z e o f t h e l a s t partition
In the case of separated calling sequences, the setup of the above param-
eters is identical. Also the BLACS command introduced in ScaLAPACK are
unnecessary as our package is independent of the library BLACS.
7.3 Spike Default

Set the default values on all the applicable components within the type
spike param variable.
47
Syntax
CALL Spike Default(pspike)
Description
The routine assigns defaults values to those input and inout components of
the type spike param variable pspike that have default. Other components
remain unchanged.
Input Parameters
pspike Intel
r
Adaptive Spike-Based Solver data structure of
type spike param described in Section 2.2.
Output Parameters
pspike Intel
r
Adaptive Spike-Based Solver data structure
described in Section 2.2. On exit, the components of
pspike tabulated in Table 2.1 will be assinged their
default values specified there.
7.4 Spike
Spike solver driver solves complete system via one call.
Syntax
CALL Spike(pspike,mat,f,info)
Description
The routine solves the system specified by a matrix contained in mat with
the right hand side(s) contained in f.
Input Parameters
pspike Intel
r
Adaptive Spike-Based Solver type
spike param data structure described in Section 2.2
mat matrix data structure of type matrix data described
in Section 2.3 and Chapter 5.
f double precision array containing the right hand
side(s). Depending on the value of pspike%tp, f may
be global on rank 0 or locally distributed on each pro-
cessor.
48
Output Parameters
pspike Intel
r
described in Section 2.2
f the computed solution of the system
info return the error code. If info=0 the execution is suc-
cessful. If info=0, the package encountered a prob-
lem and has stopped unexpectedly, the detail descrip-
tion of the meaning of error code is presented in Sec-
tion 7.11.
7.5 Spike Begin

Begin the calling sequence.
Syntax
CALL Spike Begin(pspike,mat,pre,info)
Description
The routine partitions the matrix and allocates a work table. Morever,
Spike Adapt may be invoked in this routine.
Input Parameters
pspike Intel
r
Adaptive Spike-Based Solver data structure of
type spike param described in Section 2.2. On entry,
if pspike%autoadapt is .true., Spike Adapt will be
invoked to select a solver strategy.
mat matrix data structure of type matrix data described
in Section 2.3 and Chapter 5.
pre preconditioner data structure of type matrix data.
The use of banded preconditioner is described in
chapter 4
49
Output Parameters
pspike Intel
r
described in Section 2.2. On exit, if Spike Adapt was
invoked, pspike%DFS, pspike%RSS and pspike%OIS will
be updated.
mat matrix data structure described in Section 2.3. If the
matrix is defined with global data as input, on exit,
mat will contain the local partitioning of the matrix
on each processors (the memory of the global matrix
in rank 0 is deallocated if pspike%memfree is set to
.true.).
pre Contents set by Spike Begin. It contains the lo-
cal partitioning of the preconditioner (it may just
be a copy of the matrix) that will be used in
Spike Preprocess.
info return the error code. If info=0, the execution is suc-
cessful. If info= 0, the package encountered a prob-
tion 7.11.
7.6 Spike Preprocess

Preprocess the preconditioner data.
Syntax
CALL Spike Preprocess(pspike,pre,info)
Description
The routine factorizes the preconditioner pre using the strategy specified
in pspike. Note that pre could be an explicit preconditioner supplied
by the user or is just in fact a copy (made automatically by the software
package) of the original system.
Input Parameters
pspike Intelr
pre the output from Spike Begin after the Spike Begin
call.
50
Output Parameters
pspike Intel
r
pre Contents modified, it contains the factorization of the
preconditioner ready to be used in Spike Process
multiple number of times
tion 7.11.
7.7 Spike Process

Process the matrix, preconditioner and the right-hand side.
Syntax
CALL Spike Process(pspike,mat,pre,f,info)
Description
The routine solves the reduced system then retrieves the overall solution.
In this verision of Intel
r
Adaptive Spike-Based Solver, the solver includes
outer-iterations. The preconditioner is defined by pre, and the original ma-
trix is defined by mat. The routine Spike Process can be repeated if needed
for applications that involves iterations with changing right-hand-sides f but
with the same original matrix of coecients.
Input Parameters
pspike Intelr
mat matrix data structure. On entry, the matrix
data should have been processed by a previous
Spike Begin call, so that data have been distributed
to all processors.
pre set up by Spike Preprocess in a previous call.
f On entry, f stores the right-hand side. Depending on
the value of pspike%tp, f may be global on rank 0 or
locally distributed on each processor.
51
Output Parameters
pspike Intel
r
f On exit, f stores the solution of the system. Depend-
ing on the value of pspike%tp, f may be global on rank
0 or locally distributed on each processor.
tion 7.11.
7.8 Spike End

End of the calling sequence.
Syntax
CALL Spike End(pspike,mat,pre,info)
Description
The routine clears the memory space, deallocating all local partitioning for
mat and pre.
Input Parameters
pspike Intel
r
mat matrix data structure described in Section 2.3.
pre preconditioner data structure.
Output Parameters
pspike Intelr
mat matrix data structure described in Section 2.3. On
exit, several components of mat are deallocated.
pre On exit, pre is deallocated.
tion 7.11.
7.9 spike param details

The type spike param has a number of input components that has possible
default values listed in Table 2.1. Furthermore, this type has a number of
52
output components. This is listed in the follow Table 7.2.
Component Type(Intent) Distribution Description
boost logical (out) local Return .true.if a zero-pivot is detected

|pivot| > 0 ||.||
nb boost integer (out) global # of boost performed
nbit out0 integer (out) global # of outer iteration

nbit in0 integer (out) global # of inner iteration
memory double (out) global Total amount of memory (in Mb) needed by
Spike Core
maxres double (out) global If component residual is set to .true.

return the maximum relative residual
for all rhs
failed logical (out) global Return .true.if Spike Core fails to reach
the accuracy specified in the eps out
component
error code integer (out) global If info= 0 in the calling sequences

returns the error code as presented in
Section 7.11
Below are the output components fields for timing information if the timing
component is set to .true.
tspike adapt double (out) global Time spent in Spike Adapt

tspike preparation double (out) global Preparation time (with Spike Adapt)
tspike prep double (out) global Preprocessing time
tspike process double (out) global Processing time
tspike residual double (out) global Time spent to compute the residual
Table 7.2: List of output components for the derived type spike param. The
variable of this type can be local on each partition or global (i.e. common
to all partitions).
7.10 matrix data details

The derived type matrix data is used for storage of matrices. In Intel r
Adaptive Spike-Based Solver 1.0, this is exclusively used for the matrix
representing the linear system. In the future, the user can explicitly store,
using this type, a separate matrix used as a preconditioner to the linear
system. The components and meaning of this type is given previously in
Chapter 5.
53
7.11 info details
Errors and warnings encountered during a run of Intel r
Adaptive Spike-
Based Solver are stored in an integer variable, info. All MPI, LAPACK
and PARDISO errors are fatal; in other words, execution of the program is
terminated if an error is encountered. Other possible sources of warnings and
errors are Spike Core and Spike Adapt errors . If the output info parameter
is not zero, either an error (info< 0) or warning (info> 0) was encountered.
The possible return values for the info parameter are given in Table 7.3.
info Classification Description

3 Warning Spike Adapt could not make a prediction
2 Warning A zero-pivot has been detected, OIS has
been set to 3 due to boosting
1 Warning this matrix (or precondioner if any) is not
narrow banded, this will aect the spike
performances
0 Successful exit
-1 Error Spike Core error
-2 Error Spike Adapt error
-3 Error MPI error
-4 Error LAPACK error
-5 Error PARDISO error
Table 7.3: Return code descriptions for the parameter info
If info< 0 the user can determine whether Spike Core, Spike Adapt,
MPI, LAPACK, or PARDISO is responsable for the unexpected termination.
The correponding error code is stored in the component pspike%error code.
Please refer to Table 7.4 for possible return codes on pspike%error code if a
fatal error is encountered in Spike Core (info=1), and similarly refer to Ta-
ble 7.5 if a fatal error is encountered in Spike Adapt (info=2). When info
equals 3, 4, 5, the error code is also stored in pspike%error code, and
the user should consult the MPI, LAPACK, or PARDISO documentation,
respectively.
54
info= 1 Description
0 Successful exit - Default value
-200 memory allocation error
-201 rho = 0, BiCGStab(out) failed
-202 omega =0, BiCGStab(out) failed
-303 cannot select Spike Adapt if you want to use your own
preconditioner %BPS=1
-304 the format of the preconditioner is incorrect, it should be
pre%format=D or S
-305 the preconditioner should be banded
-306 the preconditioner should be the same size as the matrix
-307 if preconditioner (option %BPS=1), one needs to use iter-
ative methods %OIS= 0)
-308 the preconditioner cannot be used with DFS=P
-309 either upper or lower bandwidth is too small for the size
of the partitions
-310 number of processors has to be even for RSS=A or P
-313 the size of the matrix mat%n must be > 1
-314 mat%kl and mat%ku must be 1
-315 the format of the matrix is incorrect, it should be
mat%format=D or S
-320 Spike Adapt cannot be selected if only one processor
-399 wrong value for %tp
-400 combinations (DFS, RSS) not supported by Version 1.0
-401 DFS=L or P are only possible options if one processor
is used
-402 DFS= A cannot be used here see Table 2.4
-405 RSS=R cannot be used here see Table 2.4
-407 only tp=0 can handle one processor run
Table 7.4: Return code descriptions for %error code
55
info= 2 Classification Description
1 Information Spike Core strategy selected by grid lookup
2 Information Spike Core strategy selected by performance
models
3 Warning Spike Core strategy selected arbitrarily
-310 Error pspike%tp=2 requires an even number if MPI
processes
-312 Error pspike%tp=2 requires RSS =A
-313 Error pspike%tp=1 cannot be used when RSS =A
-402 Error Memory allocation failed during model evalu-
ation
-403 Error SPIKE ADAPT DATA environment variable not
set
-404 Error Error reading directory specified by
SPIKE ADAPT DATA environment variable
-405 Error Performance models not found in directory
specified by SPIKE ADAPT DATA environment
variable
-406 Error Could not open performance models
-407 Error Could not read performance models
Table 7.5: This table contains descriptions of the Spike Adapt return codes
for %error code.
56
Bibliography
[1] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. DuCroz,

A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LA-
PACK: A portable linear algebra library for high-performance comput-
ers. Technical report, Knoxville, 1990.
[2] Michael W. Berry and Ahmed Sameh. Multiprocessor schemes for solv-
ing block tridiagonal linear systems. The International Journal of Su-
percomputer Applications, 1(3):3757, 1988.
[3] L. S. Blackford, J. Choi, A. Cleary, E. DAzevedo, J. Demmel,

I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stan-
ley, D. Walker, and R. C. Whaley. ScaLAPACK: a linear algebra library
for message-passing computers. In Proceedings of the Eighth SIAM Con-
ference on Parallel Processing for Scientific Computing (Minneapolis,
MN, 1997), page 15 (electronic), Philadelphia, PA, USA, 1997. Society
for Industrial and Applied Mathematics.
[4] S. C. Chen, D. J. Kuck, and A. H. Sameh. Practical parallel band tri-

angular system solvers. ACM Transactions on Mathematical Software,
4(3):270277, 1978.
[5] Jack J. Dongarra and Ahmed H. Sameh. On some parallel banded

system solvers. Parallel Computing, 1(3):223235, 1984.
[6] D. H. Lawrie and A. H. Sameh. The computation and communication

complexity of a parallel banded system solver. ACM Trans. Math.
Softw., 10(2):185195, 1984.
[7] E. Polizzi and A. Sameh. Numerical parallel algorithms for large-scale

nanoelectronics simulations using nessie. Journal of Computational
Electronics, (3), 3-4:363366, 2005.
[8] Eric Polizzi and Ahmed H. Sameh. A parallel hybrid banded system
solver: the spike algorithm. Parallel Comput., 32(2):177194, 2006.
[9] Eric Polizzi and Ahmed H. Sameh. Spike: A parallel environment for
solving banded linear systems. Computers & Fluids, 36(1):113120,
2007.
[10] A. H. Sameh and D. J. Kuck. On stable parallel linear system solvers.

J. ACM, 25(1):8191, 1978.
57
[11] O. Schenk and K. Gartner. Solving unsymmetric sparse systems of
linear equations with pardiso. Journal of Future Generation Computer
Systems, 20(3):475487, 2004.
58
Appendix A
Mathematical Description of
Key Strategies
In the following sections, we outline the algorithms corresponding to the six

(RSS,DFS) combinations supported in Intel r
Adaptive Spike-Based Solver
1.0. Since OIS is always 3 in the current release, and since BiCGStab is a
well-documented method, we will not explain it here. The following descrip-
tions assume four MPI processes.
(RSS, DFS, 3): Refine the solution of Ax = f using the BiCGStab iterative
solver.
solve Ax = f via preconditioned BiCGStab
(with preconditioner M );
solve M z = r using (RSS,DFS);
end
The exact spike factorization consists of A = D S. Each computational
scheme, however, only produces an approximation D of D and S of S. In
other words, for solving Ax = f via an iterative scheme we use M = D S
as a preconditioner. Here, A = M + R where R is a correction term.
The preconditioner M is defined as shown in Table A.1 for each (RSS, DSS)
pair.
Table A.1: Preconditioners for dierent schemes

(RSS,DFS) Preconditioner
TA MT A = DT A ST A
TU MT U = DT U ST U
FL MF L = DF L SF L
RL MRL = DRL SRL
RP MRP = D S
EA MEA = DEA SEA
Note that DT A = DEA and DF L = DRL . The reduced system in F L

is solved iteratively without forming the coecient matrix explicitly. Also,
in EA, the reduced system is solved iteratively and formed explicitly. The
details of how diagonal and spike systems are treated are given in following
59
sections. Throughout, we present the solution process of Az = r in which z
is the action M 1 r.
A.1 Az = r via TU
The matrix, RHS, and solution are distributed among the MPI processes as
shown in Figure A.1.
A1 z1 r1 (1)
B1
C2
A2 z2 r2 (2)
B2
A= C3 z= r=
A3 z3 r3 (3)
B3
C4
A4 z4 r4 (4)
Figure A.1: Illustration of the partitioning of the linear system
The T U scheme consists of the following steps:
1. Compute the LU and U L factorizations without pivoting (apply diag-

onal boosting if needed)
Lj Uj Aj for j = 1, 2, 3
Uj Lj Aj for j = 2, 3, 4
2. Compute the tips of the spikes V, W in Figure A.2 as follows

(b)
Solve for Vj :

0
.
.
Lj Uj . = for j = 1, 2, 3
.

0
(b)
Vj Bj
(t)
Solve for Wj :

(t)
Wj Cj
0

Uj Lj . = . for j = 2, 3, 4.

. .

0
This process is described in detail in Figure A.3.
60
I. *..
.. (1)
. V1
I *
*. I. *..
..
W2 .. . V2 (2)
* I *
S= I.
*. .. *..
W3 .. . V3 (3)
* I *
*. I.
..
W4 .. (4)
* I
Figure A.2: SPIKE matrix
L U = 0

Vjb Bj

Figure A.3: The bottom of the Vj spike can be computed using only the
bottom m m blocks of L and U. Similarly, the top of the Wj spike may be
obtained if one performs the UL-factorization.
3. Modify the RHS by solving: Lj Uj gj = rj (j = 1, 2) and Uj Lj gj = rj

(j = 3, 4).
4. Solve the truncated, reduced system (block diagonal) via a direct

scheme where each block has the following form:
(b)
( ) ( (b) ) ( (b) )
Im Vj zj gj
(t) (t) = (t) (j = 1, 2, 3)
Wj+1 Im zj+1 gj+1
5. Solve

0 Cj
. 0
(t) (b)
. zj+1 . zj1
Aj zj = rj
0 .
Bj 0
using the LU or U L factorization of Aj (j = 1, 2, 3, 4; C1 = 0; and
B4 = 0).
61
A.2 Az = r via FL
The F L scheme consists of the following steps:
1. Compute the LU factorization without pivoting (apply diagonal boost-

ing, if needed)
Lj Uj Aj for j = 1, 2, 3, 4
2. Modify the RHS by solving: Lj Uj gj = rj (j = 1, 2, 3, 4)
3. Solve the reduced system iteratively

(b)
(b) (b)
Im V1 z1 g1
(t) (t) (t) (t)
W2 Im V2 z2 g2
(b) (b) (b) (b)
W Im V2 z g
2 2 2
(t) (t) (t) = (t)

W 3 Im V 3 z3 g3
(b) (b) (b) (b)
W3 Im V3 z3 g3
(t) (t) (t)
W4 Im z4 g4
(t) (b) (t) (b)
where actions of the multiplications with Wj , Wj , Vj and Vj are
, - , - , -
* + 1 Im * + 1 Im * + 1 Im
realized via Im 0 Aj Cj , 0 Im Aj Cj , Im 0 Aj Bj ,
0 0 0
, -
Im
0 Im A1
* +
j Bj , respectively. This requires solving systems in-
0
volving Aj using the previously computed LU factorizations.
4. Solve

0 Cj
. 0
(t) (b)
. zj+1 . zj1
Aj zj = rj
0 .
Bj 0
using the LU factorization of Aj (j = 1, 2, 3, 4; C1 = 0; and B4 = 0).
A.3 Az = r via RL/RP

The RP scheme consists of the following steps:
1. Compute the LU factorization with (RP ) or without pivoting (RL)

(in case no pivoting is used, apply diagonal boosting, if needed)
Lj Uj Pj Aj for j = 1, 2, 3, 4 (Pj = I for RL).
2. Solve for Vj :
62

0
.

. for j = 1, 2, 3
Lj Uj Vj =
0
Bj
3. Solve for Wj :

Cj
0

Lj Uj Wj =
. for j = 2, 3, 4

.
0
4. Modify the RHS by solving: Lj Uj gj = rj (j = 1, 2, 3, 4).
5. Form the reduced system and partition it as follows

(t)
(t) (t)
Im V1 z1 g1
(b) (b) (b)
Im V1 z1 g1

(t) (t) (t) (t)

W 2 Im V 2
z g
2 2
(b) (b) (b) (b)
W2 Im V2 z2 g2

(t) = (t)

(t) (t)

W 3 I m V 3
z3 g3

(b) (b) z (b) g (b)

W3 Im V3 3 3
(t) (t) (t)
W4 Im z4 g4
(b) (b) (b)
W Im z4 g4
4
A1 B1 z1 g1
=
z2 g2
C2
A2
6. Solve for V1 and W2 in

0

C2
. 0

. , A2 W2 = .
A1 V1 =
0 .
B1 0
7. Modify the RHS A1 1

1 g1 = h1 and A2 g2 = h2 .
8. Solve the reduced system via a direct scheme

( )( ) ( )
(b) (b) (b)
Im V1 z1 h1
(t) (t) = (t)
W2 Im z2 h2
9. Retrieve z1 and z2
(t)
z1 = h1 V1 z2
(b)
z2 = h2 W2 z1
63
10. Retrieve zj (j = 1, 2, 3, 4)
(t) (b)
zj = rj Vj zj+1 Wj zj1 (V4 = 0 and W1 = 0)
A.4 Az = r via TA
A1 z1 r1 (1)
B1
C2
A= A2 z= z2 r= r2 (2, 4)
B2
C3
A3 z3 r3 (3)
Figure A.4: Illustration of the partitioning of the linear system
The T A scheme consists of the following steps:

onal boosting, if needed)
Lj Uj Aj for j = 1, 2 (processes 1, 2)
Uj Lj Aj for j = 2, 3 (processes 4, 3)
(b)
2. Solve for Vj :

0
.
.
.
Lj Uj =
for j = 1, 2
.

0
(b)
Vj Bj
(t)
3. Solve for Wj :

(t)
Wj Cj
0

Uj Lj . = for j = 2, 3.
.

.

.
0
This process is described in detail in Figure A.3.
4. Modify the RHS by solving:
Lj Uj gj = rj (j = 1, 2) and Uj Lj gj = rj (j = 3).
64
5. Solve the truncated reduced system (block diagonal) via a direct scheme
where each block has the following form:
(b)
( ) ( (b) ) ( (b) )
Im Vj zj gj
(t) (t) = (t) (j = 1, 2)
Wj+1 Im zj+1 gj+1
6. Solve

0 Cj
. 0
(t) (b)
Aj zj = rj
. zj+1 . zj1

0 .
Bj 0
using the LU or U L factorization of Aj (j = 1, 2, 3; C1 = 0; and
B3 = 0).
A.5 Az = r via EA
The EA scheme consists of the following steps:

onal boosting if needed)
Lj Uj Aj for j = 1, 2 (processes 1, 2)
Uj Lj Aj for j = 2, 3 (processes 4, 3)
2. Solve for Vj :

0
.

. for j = 1, 2
Lj Uj Vj =
0
Bj
3. Solve for Wj :

Cj
0

Uj Lj Wj =
. for j = 2, 3.

.
0
4. Modify the RHS by solving:
Lj Uj gj = rj (j = 1, 2) and Uj Lj gj = rj (j = 3).
5. Solve the reduced system via preconditioned BiCGStab
(b) (b) (b)
Im V1 z1 g1
(t) (t) (t) (t)
W2 Im V2 z2 g2
(b) (b) (b) = (b)
W 2 Im V2 z2 g2
(t) (t) (t)
W3 Im z3 g3
65
with a truncated preconditioner
(b)
Im V1
(t)
W Im

Mr = 2

(b)
Im V2
(t)
W3 Im
6. Solve

0 Cj
. 0
(t) (b)
Aj zj = rj . z
j+1 . zj1

0 .
Bj 0
using the LU or U L factorization of Aj (j = 1, 2, 3; C1 = 0; and
B3 = 0).
66
Appendix B
How Spike Adapt Works
B.1 Why is Spike Adapt Necessary?

Spike Core is a poly-algorithm implementing many dierent strategies. The
RSS, DFS, and OIS parameters can take many dierent values, leading to nu-
merous possibilities. Selecting an optimal strategy requires detailed knowl-
edge of Spike Core. For example, what strategies are best when the matrix
is not diagonally dominant? How does the matrix bandwidth aect the
choice of strategy? Spike Adapt relieves users from questions like these.
It is designed to select an optimal strategy based on the following matrix
characteristics: matrix size, bandwidth, sparsity, and diagonal dominance.
It also takes the number of MPI processes and the type of partitioning into
account when making a decision (Table 2.4).
B.2 How Does Spike Adapt Work?

Spike Adapt automatically sets the RSS, DFS, and OIS parameters when the
autoadapt element of the spike param structure is set to true. It currently
supports six Spike Core strategies (RSS,DFS): TU, RL, RP, FL, TA, and EA.
Note that OIS is basically orthogonal to (RSS,DFS). Moreover, for Intel r
Adaptive Spike-Based Solver 1.0, OIS is always set to 3 (BiCGStab) and FL

is always chosen when the input matrix is in CSR format.
Spike Adapt uses a three-step selection process. It first checks the type
of matrix partitioning and the number of MPI processes to determine which
strategies are allowed (Table 2.4). Next, it performs a grid lookup based on
the matrix size, bandwidth, and diagonal dominance (Figure B.1). The
optimal Spike Core strategy for some matrices is best determined by a
grid lookup. However, if the grid does not enclose the current matrix,
Spike Adapt evaluates performance models for the relevant Spike Core strate-
gies and decides which is best. If neither the grid lookup nor the perfor-
mance models can make a selection, a Spike Core strategy will be chosen
arbitrarily. However, this should be rare and usually indicates a problem in
Spike Adapt.
67
Figure B.1: This schematic illustrates how Spike Adapt might select an
optimal Spike Core strategy using grid lookup. The horizontal and vertical
axes represent two of the relevant matrix characteristics (e.g., matrix size
and bandwidth). If the grid encloses this matrix, an optimal Spike Core
strategy, represented by the dierent colors, is selected based on proximity.
B.3 Spike Adapt Return Codes

In the event of an error, Spike Adapt sets info=-1 and returns to Spike Core.
The actual error code is stored in the ierr spike adapt parameter of spike param
structure. Spike Adapt error codes range from -499 to -400. The meaning
of each error code is shown below.
Spike Adapt sets info=0 if it is able to select a Spike Core strategy. In
general, knowing how Spike Adapt selects a particular Spike Core strategy
is unimportant. However, this knowledge could be useful if the user suspects
that Spike Adapt is choosing a suboptimal strategy. The ierr spike adapt
parameter of the spike param structure also tells how the Spike Core strat-
egy was selected:
ierr spike adapt Description

1 Grid lookup used to select Spike Core strategy
2 Performance models used to select Spike Core strategy
3 The Spike Core strategy was selected arbitrarily
-402 Spike Adapt could not allocate memory
-403 SPIKE ADAPT DATA environment variable not set
-404 Directory containing Spike Adapt performance models not found
-405 Spike Adapt model files not found
-406 Could not open Spike Adapt models files
-407 Error reading Spike Adapt model files
Table B.1: Spike Adapt Return Codes
As mentioned above, arbitrary selection usually indicates a Spike Adapt

problem that should be reported to technical support.
68
Appendix C
MPI Compatibility Library
SpikePACK uses the Message Passing Interface (MPI) for parallel compu-
tation. Though MPI is a standard API, dierent implementations are gen-
erally not compatible because of header inconsistencies. Also, MPI libraries
built with dierent Fortran compilers are usually not compatible. There-
fore, SpikePACK does not call the MPI library directly in order to avoid
becoming dependent on a particular MPI implementation. Instead, it calls
wrapper functions contained in a separate library. Pre-built versions of this
library are provided for four common MPI implementations:
libspike mpi comm intelmpi.a - Intelr MPI Library for Linux.

This is a commercially-supported MPI 2.0 implementation from In-
tel Corporation.
libspike mpi comm mpich1.a - MPICH, an open-source MPI 1.1 im-

plementation from Argonne National Lab.
libspike mpi comm mpich2.a - MPICH2, an open-source MPI 2.0 im-

plementation from Argonne National Lab.
libspike mpi comm openmpi.a - Open MPI, an open-source MPI 2.0

implementation with many contributors, including Los Alamos Na-
tional Lab, Indiana University, and University of Tennessee.
These libraries were built with the Intel compilers and are in <spikepack
directory>/lib/<arch>, where <arch> is either 64 for the IA-64 architecture
or em64t for the Intelr 64 architecture. SpikePACK also includes a default
library libspike mpi comm.a which is identical to libspike mpi comm intelmpi.a.
It is used by the example building scripts. Users can build their own default
library if they prefer a dierent compiler or MPI implementation. To do
this, simply build a new libspike mpi comm.a using the source code for
the MPI wrappers shipped with SpikePACK, as follows:
1. Copy <spikepack dir>/include/spike mpi comm.f90 to a working di-

rectory <work dir>.
2. Go to the working directory <work dir>.
69
3. Compile the wrappers using desired MPI compiler driver for Fortran.
To use the Intel
r MPI Library and the Intel
r Fortran Compiler for
Linux, for example, compile the MPI wrappers as follows:
mpiifort -O3 -c spike mpi comm.f90
4. Create the library as follows:

ar rcv <spikepack dir>/lib/arch/libspike mpi comm.a \
spike mpi comm.o
where <arch> is either 64 for IA-64 architecture or em64t for Intel
r 64
architecture.
A libspike mpi comm.a library will be created in <spikepack dir>/lib/<arch>.

This library is specific to the MPI implementation and Fortran compiler that
was used to build it.
70

Intel Adaptive Spike-Based Solver 1.0 User Guide

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Intel Adaptive Spike-Based Solver 1.0 User Guide

Uploaded by

Copyright:

Available Formats

Intelr Adaptive

Spike-Based Solver 1.0

Intel processor numbers are not a measure of performance. Processor num-

BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino logo, Core In-

c 2008, Intel Corporation. All rights reserved.

2 The Subroutine SPIKE 12

5 Manual Data Partition 23

A Mathematical Description of Key Strategies 59

B How Spike Adapt Works 67

C MPI Compatibility Library 69

1.1 A Quick What, Why, and How

Figure 1.1: A banded matrix with a conceptual partition

Figure 1.2: Decomposition where A = DS, S = D 1 A

not be obtained exactly, either intentionally or due to limitations such as

for some residual R. Even when R is non-zero, it is by design small in some

2. Solving the system SY = G. This system has the wonderful character-

1.2 A Hello World Example

mpirun np 4 hello world.exe

TIME FOR SPIKE (FACT+SOLV ) 3 . 7 6 3 5 0 8 7 9 6 6 9 1 8 9 5 e 02

! solution is in f which resides in Processor 0

Figure 1.3: A very simple example

1.4 User Guide Outline

The Subroutine SPIKE

call Spike(pspike, mat, f, info)

Related details are given in the rest of this chapter.

2.1 Setting the environment

> source spikevarsem64t.sh

To initialize the package for CSH on an Itaniumprocessor

> source spikevars64.csh

It is recommended that the initialization command be placed in the appro-

call Spike Default(pspike)

RSS char R Reduced System Strategy:

autoadapt logical .true. strategy automatically selected if .true.

BPS integer 0 Banded Preconditioner Strategy

nbit out integer 50 max # of outer iteration

nbit in integer 100 max # of inner iteration

nzero double 109 new zero value for diagonal boosting O

tp integer 0 data distribution:

residual logical .true. compute the L relative residual norm

2.4 Disabling Spike Adapt

The three components of a strategy are:

R: stands for recursive. A recursive algorithm can be applied

format char (in) global matrix format:

vdiagdo double (out) global computed diagonal dominance value

If pspike%autoadapt and pspike%autoadapt inputs are both .true., the fields

n integer (in) global matrix dimension

The input field below is for the case mat%format=D

kl integer (inout) global # of subdiagonals in matrix

nbsa integer rank 0 # of non-zero matrix elements

E: stands for explicit. Here the Vj and Wj blocks of the S ma-

Diagonal Factorization Strategy DFS: Solving DSZ = Y naturally in-

P: stands for pivoting. This is LU factorization with partial piv-

Outer Iteration Strategy OIS: represents the iterative method use in

combinations of choices in RSS with DFS are supported or in fact meaning-

(T,U), (F,L), (R,L), (R,P), (T,A), (E,A).

Moreover, if mat%format=D the setting of the tp component of the

2.5 Running the spike adapt.exe command

is provided in the location

&IVAR matrix_size = 400000

Simply run spike adapt.exe in the same directory as ivars.nml to get

The Spike_Adapt performance models selected fl3

A single call to Spike

can be split into a calling sequence of four separate operations: