Nodal Reordering Strategies To Improve Preconditioning For Finite Element Systems

NODAL REORDERING STRATEGIES TO IMPROVE
PRECONDITIONING FOR FINITE ELEMENT SYSTEMS
Peter S. Hou
Thesis submitted to the faculty of the

Virginia Polytechnic Institute and State University
in partial fulfillment of the requirements for the degree of
Master of Science
in
Mathematics
APPROVED:
Jeff Borggaard, Chair

Traian Iliescu
Serkan Gugercin
April 27, 2005

Blacksburg, Virginia
Keywords: Nodal Reordering Strategy, Preconditioner, Finite Element Method,

Iterative Solver, Scientific Computing, Unstructured Mesh
Copyright 2005, Peter S. Hou

Nodal Reordering Strategies to Improve
Preconditioning for Finite Element Systems
Peter S. Hou
Jeff Borggaard, Chair
Mathematics
ABSTRACT
The availability of high performance computing clusters has allowed scientists and
engineers to study more challenging problems. However, new algorithms need to be developed
to take advantage of the new computer architecture (in particular, distributed memory clusters).
Since the solution of linear systems still demands most of the computational effort in many
problems (such as the approximation of partial differential equation models) iterative methods
and, in particular, efficient preconditioners need to be developed.
In this study, we consider application of incomplete LU (ILU) preconditioners for finite

element models to partial differential equations. Since finite elements lead to large, sparse
systems, reordering the node numbers can have a substantial influence on the effectiveness of
these preconditioners. We study two implementations of the ILU preconditioner: a stucture-
based method and a threshold-based method. The main emphasis of the thesis is to test a variety
of breadth-first ordering strategies on the convergence properties of the preconditioned systems.
These include conventional Cuthill-McKee (CM) and Reverse Cuthill-McKee (RCM) orderings
as well as strategies related to the physical distance between nodes and post-processing methods
based on relative sizes of associated matrix entries. Although the success of these methods were
problem dependent, a number of tendencies emerged from which we could make
recommendations. Finally, we perform a preliminary study of the multi-processor case and
observe the importance of partitioning quality and the parallel ILU reordering strategy.
This thesis is dedicated to my mom,
my dad, and my sister Ariel.
Acknowledgements
I wish to express my deepest appreciation for my advisor, Dr. Jeff Borggaard. You
provided me with research opportunity, guidance, as well as financial support. Without you, I
would not even have known what or how fascinating scientific computing is. When I
encountered difficulties during my research, your encouraging words always refueled my energy
and kept me motivated.
I would also like to thank Dr. Traian Iliescu and Dr. Serkan Gugercin for being on my
thesis committee. I understand that this is an entirely extra task that you willingly committed to
amidst your busy schedules. Your expertise has led me further into exploring this field of
mathematics.
In addition, I must acknowledge the two people who shaped my life as a mathematician,
although they taught me nothing about advanced mathematics. Thank you for talking me into
becoming a math major, Dr. Lee Johnson. Your faith in me gave me the strength to come this far.
Oh and I will never forget the torturous math trainings that you imposed on me, Ms. Jing-Lan Fu.
I used to hate them so much, but how could I have possibly become good with numbers
otherwise?
iv
Table of Contents
Acknowledgements........................................................................................................................ iv
List of Figures ................................................................................................................................ vi
List of Tables................................................................................................................................. vii
Chapter 1 Introduction ....................................................................................................................1
Chapter 2 Literature Overview .......................................................................................................2
2.1 Finite Element Methods...................................................................................................2
2.2 Linear System Solvers .....................................................................................................3
2.2.1 LU Decomposition...............................................................................................3
2.2.2 Iterative Solvers ...................................................................................................5
2.2.2.1 Jacobi Method.............................................................................................6
2.2.2.2 Gauss-Seidel Method ..................................................................................6
2.2.2.3 Successive Over Relaxation (SOR) ............................................................7
2.2.2.4 Krylov Subspace Methods: Generalized Minimum Residual (GMRES) ..7
2.3 Preconditioners ................................................................................................................9
2.3.1 Jacobi Preconditioner.........................................................................................10
2.3.2 Incomplete LU (ILU) Factorization...................................................................10
2.3.2.1 Structure-Based ILU( A ) ...........................................................................11
2.3.2.2 Threshold-Based ILUT .............................................................................13
2.4 Nodal Reordering Strategies for Finite Element Meshes...............................................14
2.4.1 Cuthill-McKee Algorithm..................................................................................15
2.4.2 Reverse Cuthill-McKee Algorithm (RCM) .......................................................16
Chapter 3 Problem Description.....................................................................................................17
Chapter 4 Numerical Experiments................................................................................................18
4.1 Finite Element Meshes...................................................................................................18
4.2 ILU(0) and ILUT ...........................................................................................................19
4.2.1 ILU(0) ................................................................................................................19
4.2.2 ILUT ..................................................................................................................19
4.2.3 Comparisons ......................................................................................................20
4.3 CM and RCM.................................................................................................................21
4.3.1 The Structure......................................................................................................21
4.3.2 The Experiments ................................................................................................22
4.4 Breadth-First Search Orderings .....................................................................................24
Chapter 5 The Parallel Case..........................................................................................................27
5.1 Onto a Parallel Computer...............................................................................................27
5.2 The Reordering Scheme.................................................................................................28
5.3 ILU Analysis ..................................................................................................................29
5.4 A Partitioning Test..........................................................................................................30
5.5 Other Partitioning Considerations .................................................................................31
Chapter 6 Conclusions ..................................................................................................................32
References......................................................................................................................................78
Vita.................................................................................................................................................81
v
List of Figures
Figure 1 Incomplete LU Factorization..........................................................................................33

Figure 2 Cuthill-McKee Ordering ................................................................................................34
Figure 3 Cuthill-McKee Starting Node.........................................................................................35
Figure 4 Finite Element Meshes ...................................................................................................36
Figure 5 ILU(0) and ILUT............................................................................................................38
Figure 6 ILU Experiments ............................................................................................................39
Figure 7 CM and RCM .................................................................................................................40
Figure 8 “Umbrella Regions” .......................................................................................................41
Figure 9 Natural v. RCM Ordering on ILUT................................................................................42
Figure 10 Mesh Partitioning .........................................................................................................43
Figure 11 Mesh Partitioning for Parallel ILU...............................................................................44
Figure 12 Parallel ILU ..................................................................................................................46
Figure 13 Mesh Partitioning Test 1...............................................................................................47
Figure 14 Mesh Partitioning Test 2...............................................................................................48
vi
List of Tables
Table 1 CM v. RCM and ILU(0) v. ILUT.....................................................................................49

Table 1.1 2d..........................................................................................................................49
Table 1.2 3d..........................................................................................................................50
Table 1.3 two_hole_0...........................................................................................................51
Table 1.4 two_hole_1...........................................................................................................52
Table 1.5 two_hole_2...........................................................................................................53
Table 1.6 two_hole_3...........................................................................................................54
Table 1.7 two_hole_4...........................................................................................................55
Table 1.8 four_hole_0 ..........................................................................................................56
Table 1.9 four_hole_1 ..........................................................................................................57
Table 1.10 four_hole_2 ........................................................................................................58
Table 1.11 four_hole_3 ........................................................................................................59
Table 1.12 cross_dom_0 ......................................................................................................60
Table 1.13 cross_dom_1 ......................................................................................................61
Table 1.14 cross_dom_2 ......................................................................................................62
Table 1.15 cross_dom_3 ......................................................................................................63
Table 1.16 cross_dom_4 ......................................................................................................64
Table 1.17 two_dom_0.........................................................................................................65
Table 1.18 two_dom_1.........................................................................................................66
Table 1.19 two_dom_2.........................................................................................................67
Table 1.20 two_dom_3.........................................................................................................68
Table 1.21 two_dom_4.........................................................................................................69
Table 1.22 two_dom_5.........................................................................................................70
Table 2 15 Ordering Schemes and Their Effects on GMRES Iterations.......................................71
Table 2.1 two_hole_domains ...............................................................................................71
Table 2.2 four_hole_domains...............................................................................................72
Table 2.3 cross_dom_domains.............................................................................................73
Table 2.4 two_dom_domains ...............................................................................................74
Table 3 Mesh Partitioning Tests for Parallel ILU .........................................................................75
Table 3.1 two_hold_0...........................................................................................................75
Table 3.2 two_hold_2...........................................................................................................76
Table 3.3 two_hold_4...........................................................................................................77
vii
Chapter 1 Introduction
Many computational problems in science or engineering require the solution of large sparse
linear systems [6, 10, 14]. These systems have the form of finding an n-dimensional vector x
such that
Ax = b
where A is an n-by-n matrix and b is the n-dimensional right hand side vector. Due to the
challenge in solving these problems and their importance in real-world modeling and analysis, a
wide class of numerical algorithms has been developed to solve them. These algorithms are
very specialized, taking advantage of problem structure and computer architecture. A popular
class of algorithms is based on Krylov subspaces [18, 28]. This is an iterative method that,
under certain conditions (good problem conditioning, good initial guess, appropriate parameter
tuning, etc.) is much more efficient than direct methods. It also has the advantage that it lends
itself to parallel implementations. Thus, Krylov subspace methods are a popular choice in high
performance computing applications.
One of the limitations of the iterative methods is the condition number of the matrix A ,
K ( A) = A A−1 ,
where i represents one of the matrix norms (usually the 2-norm). There is a correlation
between the number of iterations the algorithm requires to converge (hence the computation cost)
and the magnitude of the condition number. The closer K ( A) is to 1, the better. The notion
of left preconditioning is to premultiply the linear system above by a matrix P that is a good
approximation to A−1 ,
PAx = Pb ,
such that K ( PA) is closer to 1. The selection of a good preconditioner is critical to
developing a high performance algorithm. This is typically problem dependent, though a
number of popular strategies have emerged. Not many of these have good parallelism.
In this work, we focus on developing an incomplete LU (ILU) preconditioner for solving

linear systems generated using finite element methods. Unlike other numerical methods for
partial differential equations, such as Finite Difference Methods, those that we deal with are each
based on an unstructured finite element mesh. Therefore, in addition to the generic methods for
solving the linear system, we can try to improve the efficiency of the preconditioner by
reordering the nodes of the mesh in a logical manner. We test a number of standard reordering
strategies as well as perform a parameter study for the ILU preconditioner. Then, we explore
the importance of nodal reordering for ILU in the parallel case.
1
Chapter 2 Literature Overview
This chapter introduces and examines some well-known techniques involved in solving the
linear systems of our interest on a computer. Section 2.1 briefly describes finite element
methods as the origin of our problem. Section 2.2 examines classic linear system solvers, and
leads into more efficient iterative solvers. Section 2.3 discusses preconditioners which
preprocess the linear systems to help iterative solvers converge faster. Lastly, Section 2.4
introduces finite element nodal reordering strategies that can potentially make preconditioners
more effective.
2.1 Finite Element Methods

The finite element methods are a family of extremely powerful numerical techniques
developed to solve complex problems in solid mechanics, fluid dynamics, heat transfer,
vibrations, etc [4]. It breaks down a complex, continuous physical geometry into simpler, finite
number of components, and uses these components as basis functions to approximate the
solution of the problem. Because of their flexibility to adapt to a wide range of applications and
their ability to greatly reduce the complexity of each problem, the finite element methods are a
very popular tool in the scientific and engineering community.
The first step in a finite element method is to discretize the domain Ω of interest into a
finite element mesh, which is an undirected graph with nodes spaced across the domain. The
density of the nodes, or mesh points, may vary depending on the complexity of the subdomains.
Each mesh point is associated with an unknown and a basis function ϕ , which has value 1 at the
mesh point and zero elsewhere.
Consider the Poisson problem: −Δu = f in Ω , and u = 0 on ∂Ω . For all functions v

smooth in the domain Ω and vanish on the boundary ∂Ω , the weak formulation of the
problem is
∫∫ (−Δuv)dA = ∫∫ ( fv)dA ,
Ω Ω
which by Divergence Theorem, becomes
∫∫ (∇u ⋅∇v)dA = ∫∫ ( fv)dA .
Ω Ω
n
Then we discretize u ( x, y ) ≈ u ( x, y ) = ∑ u jϕ j ( x, y ) , and choose v = ϕi
h
j =1
The problem is then rewritten as a list of n partial differential equations, for 1 ≤ i ≤ n :
∑ ⎡⎣( ∫∫ )
n
∇ϕi ( x, y )∇ϕ j ( x, y )dA u j ⎤ = ∫∫ f ( x, y )ϕi ( x, y )dA
j =1
Ω ⎦ Ω
Define Aij = ∫∫Ω ∇ϕi ( x, y )∇ϕ j ( x, y )dA , xi = ui , and bi = ∫∫Ω f ( x, y )ϕi ( x, y )dA , this can be
represented as a linear system Ax = b , where x is a column vector of unknown values at the
mesh points, while A and b correspond to the left- and right-hand sides of the equation.
2
From here, the complex physical problem has been reduced down to a standard system of
equations.
The finite element methods use basis functions ϕi with local support. As a means to
construct these, the problem domain Ω is partitioned into regular subdomains: e.g., intervals in
1-D, triangles or rectangles in 2-D, and tetrahedrons or bricks in 3-D. Nodes are placed on
vertices and perhaps edges, faces or interiors on which piecewise polynomial bases are generated.
Where there is a natural assignment of unknown numbers in 1-D elements, there are many nodal
ordering choices in higher dimensions. As we discuss below, this ordering has one dramatic
impact on fill-in for direct linear system solvers and this carries over to preconditioners based on
direct solvers. This is the main topic of this research.
2.2 Linear System Solvers

The partial differential equations of our interest are presented as a linear system:
Ax = b
where A is an n-by-n matrix, x is an n-by-1 vector of unknowns, and b is an n-by-1 vector.
The most trivial and primitive method is to find the inverse of A , assuming it exists.
Direct methods such as Gaussian Elimination are easy to understand and implement, and they
produce exact inverses up to finite precision arithmetic. Subsequently, x = A−1 Ax = A−1b
gives us the solution of the system.
This method, however, is rarely used beyond elementary linear algebra class. The reason
is simple: computing the inverse of a matrix is too expensive. Real-world problems can easily
have millions of equations with millions of unknowns. Computing the inverse of such a system
not only requires tremendous computation power, but also can take up unrealistic amount of
memory. In addition, this process is not parallelizable, hence cannot efficiently speed up even
with multiple processors. Therefore, we introduce some linear system solvers that can be
implemented more effectively.
2.2.1 LU Decomposition
The LU decomposition is a process, based on Gaussian Elimination, to decompose a matrix

A into matrix factors L and U :
A = LU .
Here, L is a lower triangular matrix with ones on the main diagonal. On the other hand, U
is an upper triangular matrix. Several versions of this decomposition exist, and Algorithm 1
below is a row-based implementation.
In the 4-by-4 case, the factors look like:
3
⎡ A11 A12 A13 A14 ⎤ ⎡ 1 0 0 0 ⎤ ⎡U11 U12 U13 U14 ⎤
⎢A A22 A23 A24 ⎥⎥ ⎢⎢ L21 1 0 0 ⎥⎥ ⎢ 0 U 22 U 23 U 24 ⎥⎥
⎢ 21 = ⎢
⎢ A31 A32 A33 A34 ⎥ ⎢ L31 L32 1 0⎥ ⎢ 0 0 U 33 U 34 ⎥
⎢ ⎥ ⎢ ⎥⎢ ⎥
⎣ A41 A42 A43 A44 ⎦ ⎣ L41 L42 L43 1⎦ ⎣ 0 0 0 U 44 ⎦
One advantage of this factorization is the ability to store L and U in the same matrix to
conserve storage space. Algorithm 1 takes such advantage and factors matrix A in-place.
The output of this algorithm is in the form:
⎡U11 U12 U13 U14 ⎤
⎢L U U 23 U 24 ⎥⎥
LU = ⎢ 21 22
⎢ L31 L32 U 33 U 34 ⎥
⎢ ⎥
⎣ L41 L42 L43 U 44 ⎦
which, when necessary, can be easily broken into two separate matrices. Note that because the
value of the main diagonal of L is known, it need not to be stored in LU .
For j = 1 to n -1
For i = j + 1 to n
α = Ai , j Aj , j
For k = i to n
Ai ,k = Ai ,k − α Aj , k
End
Ai , j = α
End
End
Algorithm 1: An LU decomposition
We see from line 3 in the algorithm that numerical accuracy may be at stake if any entry on
the main diagonal of A becomes very small during the factorization. To improve on this
situation, we could apply optional pivoting: permute the rows of A so that the absolute
maximum element in each column lies on the main diagonal. This rearrangement of equations
does not affect the solution, as long as proper permutation is also applied to x and b .
After matrix A has been decomposed into the appropriate factors, we can substitute L
and U to solve the linear system Ax = b :
Ax = ( LU ) x = L(Ux ) = b
i
Let y = Ux . Note that ∑L y
j =1
ij j = bi for all 1 ≤ i ≤ n , so we can solve Ly = b recursively
4
using a forward substitution algorithm.
For i=1 to n
i −1
1
yi = (bi − ∑ Lij y j ) ( Lii = 1)
Lii j =1
End
Algorithm 2: Forward Substitution
Next, we can solve Ux = y using a backward substitution, which is a reversed version of

the forward substitution. Each triangular substitution reduces the problem down to n equations,
which, when solved in order, has only one unknown per equation. This simplifies the solution
to the linear system.
Generally, LU decomposition is a more preferred solution to a linear system than x = A−1b .

However, its memory requirement can be equally unbearable as the problem size increases.
This is particularly true when A is sparse (most entries are zero, and only nonzero entries need
to be stored), since the factors L and U are dense in general. Hence, modern computational
problems have resorted to much more memory-efficient iterative methods, whereas LU
decomposition still serves as the underlying idea for some of the most robust preconditioning
techniques.
2.2.2 Iterative Solvers
In the case when computing the exact solution to a linear system is impossible or infeasible,
iterative solvers can be employed to numerically approximate the true solution x . Let x0 be
an initial guess of the solution x . The iterative solver computes a sequence {xk } , with
xk → x as k → ∞ . The residual vector rk = b − Axk is used to determine how close we are
to the true solution. Obviously, for a well-conditioned problem, rk ≈ 0 when xk is a good
approximation to x . An iterative solver stops when the residual becomes smaller than a
specified threshold, or when a certain number of iterations have been reached without a
convergence. Good iterative solvers aim to converge {xk } quickly and to minimize the norm
of the residual vector.
We introduce four common iterative solvers, in increasing order of sophistication: Jacobi

[30], Gauss-Seidel [20], Successive Over Relaxation (SOR) [31], and Generalized Minimum
Residual (GMRES) [28]. The first three are based on the following breakdown of matrix A :
A = L + D +U .
The matrices L and U are the strictly lower and upper triangular parts of A , and D is
5
the main diagonal. As we discussed above, inverting a triangular matrix can be performed by
forward or backward substitution. Inverting a diagonal matrix simply involves inverting the
diagonal entries. Hence, the inverses appearing below are computationally tractable.
2.2.2.1 Jacobi Method
From the matrix break-down above, we rewrite Ax = b
Ax = ( L + D + U ) x = b .
Then, we move L and U to the right hand side to motivate the Jacobi iteration,
( Dxn +1 ) = b − ( L + U ) xn
xn +1 = D −1 (b − ( L + U ) xn ) .
The Jacobi method solves each equation in the linear system independently [30]. It solves
one variable xi at a time while assuming all other variables x remain fixed. It is extremely
parallelizable in nature. Unfortunately, while this simple idea is very easy to implement, it is
very unstable. It works well with diagonally dominant tridiagonal matrices, but its convergence
is not guaranteed otherwise. The matrix D −1 ( L + U ) must have all of the eigenvalues inside
the unit disk (the smaller, the better).
2.2.2.2 Gauss-Seidel Method
The Gauss-Seidel method is fairly similar to the Jacobi method [20]:
( L + D ) xn +1 = b − Uxn
xn +1 = ( L + D ) −1 (b − Uxn )
.
Note that ( L + D) is a lower triangular matrix, so the iterations can be computed using a
forward substitution (no matrix inversion is necessary). Due to this nature, the computations
are sequential: solving each equation requires the solutions from the previous equations.
Therefore, this algorithm is not parallelizable like Jacobi method. However, it is relatively
more stable, and is applicable to strictly diagonally dominant matrices and symmetric positive
definite matrices.
Note: the Gauss-Seidel method may also be implemented as xn +1 = ( D + U ) −1 (b − Lxn ) if

( D + U ) −1 L has smaller eigenvalues.
6
2.2.2.3 Successive Over Relaxation (SOR)
SOR is derived from extrapolating the Gauss-Seidel method, by taking a weighted average
on the two sides of the equal sign [31]:
(ω L + D ) xn +1 = ωb − ωUxn + (1 − ω ) Dxn
xn +1 = (ω L + D ) −1 (ωb − ωUxn + (1 − ω ) Dxn )
0<ω < 2
When ω is chosen properly, this method speeds up convergence rate. The difficult task
is to choose a good value for each specific problem. When ω = 1 , SOR reduces to
Gauss-Seidel. Also, this method fails to converge if ω falls outside of (0, 2).
2.2.2.4 Krylov Subspace Methods: Generalized Minimum Residual (GMRES)
The Krylov subspace methods are a family of iterative solvers that, unlike the three above,
do not have an iterative matrix. Their implementations are based on the minimization of some
measure of error over the affine space x0 + K k at each iteration k . x0 is the initial iterate
vector and K k is the kth Krylov subspace,
K k = span {r0 , Ar0 ,..., Ak −1r0 } ,
where r0 = b − Ax0 is the initial residual vector.
Many variants of Krylov subspace methods exist, and they possess various strengths and
limitations. Well known versions include the Conjugate Gradient Method, the General
Conjugate Residual Method, and the Minimum Residual Method, whose applications limit to
symmetric positive definite systems, non-symmetric positive definite systems, and symmetric
indefinite systems, respectively [18]. The most popular variant of Krylov subspace methods is
the Generalized Minimum Residual (GMRES) Method [28], due to its applicability to
non-symmetric indefinite systems. We introduce this method here and use it as our iterative
solver in the experiments.
GMRES generates an orthonormal basis explicitly, using a modified Gram-Schmidt

orthonormalization:
wi = Avi
For k = 1 to i
wi = wi − wi , vk vk
End
vi +1 = wi wi
When applied to the Krylov sequence, this method is known as the Arnoldi Algorithm [1].
7
The inner product coefficients wi , vk and wi are stored in an upper Hessenberg matrix.
Suppose we generated the complete set of an orthonormal basis V , we can represent the
n
solution as x = x0 + ∑ vi yi , where vi are column vectors of V , and yi are scalars chosen to
i =1
n
minimize at each step the norm of the residual vector b − Ax = b − A( x0 + ∑ vi yi ) . In other
i =1
words, this algorithm always converges to the exact solution in at most n iterations, provided
exact arithmetic is used. In practice, however, this fact does not have much value. When n is
large, not only is the number of iterations unaffordable, but the storage required to store V and
H also becomes prohibitively tremendous.
The restarted version of GMRES overcomes this problem. Given a natural number m ≤ n ,
the algorithm stops and “restarts” after m iterations. The intermediate result
m
xm = x0 + ∑ vi yi is used as the new x0 , V and H are cleared from the memory, and the
i =1
whole process repeats from the beginning until convergence is achieved.
Choose x0
Compute r0 = b - Ax0 ; v1 = r0 / r0 ; β1 = r0
For j = 1 to m
Compute w j = Av j
For i = 1 to j
hi , j = w j , vi
w j = w j − hij vi
End
h j +1, j = w j
If h j +1, j = 0 then m = j; exit for
V j +1 = w j h j +1, j
End
Define the (m + 1) × m Hessenberg matrix H m = (hi , j )
Compute ym to minimize β1e1 - H m y
xm = x0 + Vm ym
Algorithm 3: Restarted GMRES
8
The difficult task in the restarted version of GMRES is to choose an appropriate m .
When m is too small, the algorithm may converge very slowly or fail to converge. When m
is too large, excessive computations and storage make the process unnecessarily expensive.
Unfortunately, the optimal m depends entirely on each particular system, and there is no definite
rule for choosing this number.
2.3 Preconditioners
In nearly every practical example, iterative methods for the original linear system converges
too slowly. Analyses of these algorithms have found that there is a correlation between the
conditioning of the system matrix, measured by the condition number,
K ( A) = A A−1 ,
and the number of iterations (work) required to converge [18]. Thus, the original system must
be preconditioned to improve algorithm performance. However, to be effective, the cost of
solving the preconditioned system (including the cost of preconditioning) should be less than the
cost of solving the original system. In fact, the reduction needs to be dramatic for iterative
methods to be effective.
In this study, we consider left preconditioners. Thus, we “premultiply” both sides of the
equation by a preconditioning matrix P ,
( PA) x = Pb .
The optimal preconditioner would be the inverse of A (although this would never be a practical
preconditioner). The resulting preconditioned matrix would have a condition number of 1 (the
smallest possible).
In general, the larger the condition number is, the harder it is to find a good approximate
inverse for the matrix. The base-b logarithm of K ( A) estimates how many base-b digits are
lost in solving a linear system with matrix A . The convergence of GMRES is bounded by
k
⎛ K ( A) − 1 ⎞
rk ≤ C ⎜
⎜ K ( A) + 1 ⎟⎟ 0
r ,
⎝ ⎠
where rk is the k th residual vector in GMRES [18]. Moreover, the accuracy of any iterative
solution is bounded by
x − xk r
≤ K ( A) k ,
x b
where x is the true solution and xk is the k th approximation to x [5]. Therefore,
reducing the condition number of the system is important for both speed and accuracy of an
iterative solver.
Many preconditioners with different strengths and applications have been developed, and
we examine two: Jacobi Preconditioner [18] and the ILU Preconditioner family [13, 27]. They
are both based on modified versions of two linear system solvers. We shall see how impractical
9
solvers can be transformed into powerful preconditioners.
2.3.1 Jacobi Preconditioner
The Jacobi Preconditioner, as known as the Diagonal Preconditioner, is derived from the
Jacobi Iterative Method. It applies the inverse of the diagonal entries of A to both sides of the
equation, with the hope of reducing the condition number. If matrix A were diagonally
dominant, the inverse of its diagonal may be a good approximation to the inverse of A itself.
⎡ a11 a12 a1n ⎤ ⎡ a11−1 0 0 ⎤

⎢a ⎢ ⎥
a22 a2 n ⎥⎥ 0 a22 −1 0 ⎥
If A = ⎢ 21 , then Jacobi Preconditioner P = ⎢ .
⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥
⎣ an1 an 2 ann ⎦ ⎢⎣ 0 0 ann −1 ⎥⎦
It works well on certain diagonally dominant matrices. Like Jacobi Method, this
preconditioning procedure is highly parallelizable but very unstable. Improved versions such as
Block Diagonal Preconditioning are available, but suffer from similar limitations [20].
2.3.2 Incomplete LU (ILU) Factorization
Recall that in LU factorization, we factor matrix A

A = LU
and then compute
x = U −1 L−1 ( LU ) x = U −1 L−1 ( Ax ) = U −1 L−1b
This is a very stable solution for a linear system. However, it is not often used in practice due
to one problem – its high memory consumption.
Define a fill-in to be an initially zero entry in matrix A whose value becomes nonzero as a
result of the basic row operations in the LU factorization. When any Gaussian
Elimination-based algorithm is applied to a sparse matrix, many fill-ins take place and the
resulting product may become very dense. As the number of nonzero entries increase, the
memory requirement increases. When the problem size is large, it can easily become
unbearably expensive to store all the fill-in entries created by LU. Therefore, Incomplete LU
(ILU) Factorization is developed as a practical alternative [22].
First, we “approximate” the lower and upper triangular factors of matrix A :
A = LU ≈ LU ; L ≈ L ; U ≈ U
The U and L produced by ILU are upper- and lower-triangular matrices “close” to U
and L factors of A , less some or all of the fill-ins. See Figure 1 to visualize ILU’s reduced
memory cost. A major decision in implementing an ILU factorization is to determine which
10
fill-ins to allow, and which to eliminate. Eliminating more fill-ins keeps the factors more sparse,
saving memory space and computing power. On the other hand, allowing more fill-ins keeps
U and L “less different” from U and L , so that U −1 L−1 stays closer to A−1 .
Then with hope, A := U −1 L−1 ( A) = U −1 L−1 ( LU ) ≈ I , and
x ≈ Ax = U −1 L−1 ( Ax ) = U −1 L−1b
Instead of being a linear system solver like complete LU, the ILU serves as a preconditioner.
With U −1 L−1 being “close” to A−1 , A would be “close” to the identity matrix. In other
words, it would have a smaller condition number. When an iterative solver is used to solve this
modified system, convergence would be reached faster with higher accuracy. Various ILU
implementations use different theories to eliminate fill-ins, trading off memory requirement for
conditioning quality and vice versa. We shall study the two major families of ILU algorithms:
the structure-based ILU( ) [13] and the threshold-based ILUT [27].
2.3.2.1 Structure-Based ILU( )
The structure-based ILU( ) implementations allow and deny each fill-in based on its
location relative to the structure of the matrix [13]. The first phase determines the locations of
permissible fill-in entries by assigning each location a level. A fill-in is allowed if its level is
less than or equal to . In the second phase, an LU factorization takes place, using the
“incomplete” fill-in pattern determined in the first phase to keep certain zero entries intact.
In Algorithm 4, the matrix Λ contains the level values for the entries of A . Each level
is a nonnegative integer, and Λ ij ≤ indicates that a fill-in is allowed for Aij in the
factorization. Entries of Λ are initially set to undefined, and some would stay undefined if the
entry is not a possible fill-in (i.e. if the entire column above the entry is zero).
In essence, this algorithm works as follows: if an entry is initially nonzero in A , it has level
0 and no limit is imposed on that entry. If an entry is initially zero, then any possible fill-in at
this location depends on a nonzero entry to its left (the pivot in Gauss Elimination) and a nonzero
entry above it (the row whose multiple adds to this row). Each entry’s level depends on the
levels of the two entries that may be causing its fill-in. Successor entries are considered “less
important” than the predecessor entries and therefore have strictly higher levels. There are two
popular implementations for weighing a level based on its predecessor entries’ levels:
The sum rule:

computeWeight (a, b)
{
return(a + b + 1);
}
11
The max rule:
computeWeight ( a, b)
{
return(max{a, b} + 1);
}
We are not to compare the strengths of these two rules in this thesis. However, one should
be able to tell that the succession of levels grows faster under the sum rule. Therefore ILU( )
would allow less fill-in entries under the sum rule, given the same .
Define n × n sparse matrix Λ with undefined entries

For i = 1 to n
For j = 1 to n
If Aij ≠ 0
Define Storage for Λ ij
Λ ij = 0
End
End
For j = 1 to i -1
If Aij ≠ 0
For t = j + 1 to n
If Ajt ≠ 0
w = computeWeight (Λ ij , Λ jt )
If Undefined (Λ jt )
Define Storage for Λ jt
Λ jt = w
Else
Λ jt = min{Λ jt , w}
End
End
End
End
End
End
Algorithm 4: Level-assigning phase of ILU( )
12
ILU(0) is a special case of the ILU( ) family. Only entries with level 0, or initially
nonzero entries, are allowed to be nonzero after the factorization. In other words, the factors
have the same sparsity pattern as the original matrix A . ILU(0) is a popular method since it is
intuitively predictable, and its factors require the minimum amount of memory space among the
entire ILU( ) family.
2.3.2.2 Threshold-Based ILUT
Unlike ILU( ), the threshold-based ILUT algorithms maintain the sparsity of a matrix by
controlling the magnitudes of its entries [27]. In the factors, the significance of a nonzero entry
no longer depends on its relative location, but on its absolute value. Imagine that we drop a
certain number of the smallest nonzero entries from L and U , the complete LU factors of A ,
to produce L and U . These incomplete factors are now more sparse, yet still fairly similar to
the complete factors L and U .
In general, ILUT( t ) drops any element whose magnitude is smaller than the threshold.
The threshold is often a number t in [0,1] multiplied by the norm of the active row (or column,
if the algorithm is column-based) in the factorization process. In other words, Lij in the
approximate factorization is replaced by zero if its absolute value is less than t* = t Ai ,* .
For j = 1 to n -1
For i = j + 1 to n
t* = t Ai ,* 2
α = Ai , j Aj , j
For k = i to n
Ai ,k = Ai ,k − α Aj ,k
End
For k = i + 1 to n
If Ai ,k <t* Then Ai ,k = 0
End
Ai , j = α
If Ai , j <t* Then Ai , j = 0
End
End
Algorithm 5: An implementation of ILUT
13
Unlike in structure-based ILU( ), a nonzero entry in the original matrix A does not
guarantee a nonzero entry at the same location in the factors. Note that all entries can be
possibly dropped, except those on the main diagonal are kept intact so the factors still can be
nonsingular. A smaller threshold value keeps the factors closer to complete, and ILUT(0) is
identical to the exact LU factorization. On the other hand, a very large threshold would
eliminate most off-diagonal entries. ILUT(1) produces an identity matrix L and a diagonal
matrix U , which, when used as a preconditioner, is identical to the Jacobi Preconditioner.
The structure-based ILU( ) family and the threshold-based ILUT family are the two major
branches of incomplete LU factorizations. A large number of more sophisticated and robust
ILU algorithms have been developed for different applications [13, 16, 33], and most of them are
based on one of these two basic ideas.
2.4 Nodal Reordering Strategies for Finite Element Meshes

We have now seen some iterative techniques to efficiently solve large linear systems and
some preconditioning methods to speed up the performance of iterative solvers. To further
improve the overall computing performance, there is one more field to explore – nodal
reordering.
The linear systems in this study are constructed from finite element meshes. Before we
construct the linear system, it is possible for us to assign numbers to the mesh nodes in different
orders, to make the system “easier to process.” The same effect can be achieved by permuting
the rows and columns of the linear system, although permutations of a large system can be
incredibly expensive if they involve the swapping of physical entries.
Depending on how the finite element mesh is created, its nodes may be ordered in a way
that makes computing inefficient. What we refer to as “natural ordering” usually numbers the
nodes by the order in which they enter the system, which may be completely unrelated to the
geometrical structure or connectedness of the mesh. As we shall see, a less intuitive ordering
scheme is often desired for many different reasons.
A typical method for developing parallel algorithms for finite element methods is to
distribute groups of elements to different processors. Therefore, if a node is affiliated with
elements that lie on different processors, then operation with these nodes require passing data
between processors. The cost for inter-processor communication is often very significant.
Hence, there are two issues: The first is optimally partitioning the elements to minimize the
number of nodes that must be shared (and to achieve good load balancing). The second is the
numbering of nodes on the mesh so that the parallel preconditioner is optimal.
Some reordering strategies can conserve computer storage as well as reduce the actual
calculation time, since they influence the performance of some preconditioners. These
strategies and their potential benefits to finite element methods are of particular interest to us.
We introduce the classic Cuthill-McKee Algorithm here as a starter, and go into further analysis
in a later section.
14
2.4.1 Cuthill-McKee Algorithm
In order to solve a large system of equations efficiently, one must conserve computer
storage as well as calculation time. E. Cuthill and J. McKee devised a robust algorithm to
condition sparse symmetric matrices by reordering the nodes [7].
Given an n-by-n matrix A , we define the bandwidth of A as 1max { i − j ai , j ≠ 0} . In

≤i , j ≤ N
other words, it is a measure of how far the nonzero elements lie from the main diagonal.
Matrices with small bandwidths have several advantages, as we shall see later. The
Cuthill-McKee algorithm is designed to reduce the bandwidth of a matrix. Figure 2 shows an
example of its bandwidth reduction ability.
The basic idea is, that for a sparse matrix A , we want to find a permutation matrix P ,
such that the matrix PAPT permutes the rows and columns of A to “move” the nonzero
elements as close to the main diagonal as possible, hence reducing the bandwidth. In practice,
however, permuting a large matrix would be extremely inefficient, so this algorithm aims to
reorder the nodes on the graph of matrix A prior to the matrix’s construction, effectively
introducing a permuted index set.
Before moving on to the algorithm, we shall review some basic terminology in graph theory.
A graph consists of a finite set of nodes (or vertices) connected by a finite set of edges. In a
weighted graph, each edge is assigned a weight, which is a numerical value. The degree of a
node is the sum of weights of the edges connected to it, and in an unweighted graph it is simply
the number of edges connected to it. Two vertices are adjacent if there is an edge between them.
A path is a sequence of consecutive edges, and two nodes are connected if there is a path from
one to the other. A graph is connected if every node in it is connected to every other node. A
component is a connected subgraph. A circuit is a path which ends at the starting node. A tree
is a graph containing no circuit, and a spanning tree of a graph is a subgraph that is a tree and
contains all of the nodes. For more detailed discussion, see [8].
Given a graph G
Select a node of minimum degree, label it 1
When k nodes have been labeled, 1 ≤ k < n ,
Select the smallest i such that node i has unlabeled neighbors
Locate all of node i ’s unlabeled neighbors ( u1 ,..., um )
In increasing degree order, label these nodes k + 1,..., k + m
Repeat until all n nodes in G have been labeled.
Algorithm 6: Cuthill-McKee Algorithm
In the event that G has more than one component, this algorithm stops after it labels an
entire component’s m nodes with a tree, m < n . Continue by labeling a node of a minimum
15
degree on another component m + 1 , and repeat until the all nodes in the graph are labeled.
When dealing with finite element meshes that are entirely connected, this algorithm generates a
spanning tree across the entire mesh.
In essence, given a labeled node i in the graph, this algorithm labels its next neighbor with
the number j as close to i as possible. The edge connecting these two nodes becomes Aij
in A , and such nodal ordering keeps i − j small. It is easy to see an upper bound for the
bandwidth of A is 2m − 1 , where m is the maximum number of nodes per level of the
spanning tree generated by the algorithm.
The Cuthill-McKee Algorithm reduces, but does not necessarily minimize bandwidth. For
a specific family of matrices that share certain special properties, it is possible to devise an
algorithm to reduce bandwidth beyond Cuthill-McKee’s ability. Moreover, even for the same
graph, Cuthill-McKee can yield several matrices of different bandwidths, depending on the
starting node and the ordering of equal-degree nodes. See Figure 3 for an example.
Nonetheless, Cuthill-McKee ordering generally provides significant bandwidth improvement
from natural ordering, applies to a wide range of problems, and can be easily automated. Due
to these benefits, this algorithm is widely used in scientific computing.
2.4.2 Reverse Cuthill-McKee Algorithm (RCM)
While the Cuthill-McKee Algorithm is well-known for its ability to reduce the bandwidth,
many preconditioners implement a more popular variation called the Reverse Cuthill-McKee
(RCM) Algorithm [11]. As its name suggests, this variation uses the same ordering pattern but
assigns numbers backwards.
Given a graph G
Select a node of maximum degree, label it n
Select the largest number i such that node i has unlabeled neighbors
Locate node i ’s unlabeled neighbors ( u1 ,..., um )
In decreasing degree order, label these nodes n − k ,..., n − k − m + 1
Algorithm 7: Reverse Cuthill-McKee
Although the algorithms seem very similar, the original Cuthill-McKee and the Reverse
Cuthill-McKee behave rather differently. We shall explore their differences in a later section.
16
Chapter 3 Problem Description
In the 2002 SIAM Review article by Oliker, Li, Husbands, and Biswas, “Effects of Ordering
Strategies and Programming Paradigms on Sparse Matrix Computations,”
The quality of an ILU preconditioner (in terms of convergence rate) also

has a nontrivial dependence on the ordering; however, this is outside the
scope of our paper. [25]
This sparked our interest in the relationship between ILU preconditioners and nodal
reordering strategies. Which ILU works best? Which ordering strategy can improve
preconditioning quality? Do different ILU preconditioners perform well with different
reordering schemes? We want to search for efficient methods for solving finite element linear
systems by studying the combined behavior of nodal reordering schemes and ILU
preconditioners.
First, we propose a numerical study for this problem using Matlab. The strengths and
weaknesses of structure-based ILU( ) and threshold-based ILUT are compared using a series of
examples. Next, classic Cuthill-McKee and Reverse Cuthill-McKee algorithms are analyzed.
Then, a detailed numerical study is conducted involving multiple meshes, reordering strategies,
and preconditioners. We seek a trend in reordering-preconditioning pair that can most
efficiently simplify the solution of our linear systems.
Second, we want to investigate the same problem on parallel computers. In many

real-world scientific applications, the problem of interest is so large that a single desktop
computer cannot manage the calculation. When stored onto a parallel machine with more than
one processor (with distributed data), the linear system becomes much more sophisticated than
before. To make parallel computers effective for these problems, many linear-algebra
operations need different, more sophisticated algorithms. We study how the ILU
preconditioners can be implemented in parallel, and how nodal reordering strategies can play an
important role in this case.
17
Chapter 4 Numerical Experiments
The following numerical experiments are carried out with Matlab 7.0 (R14) on a Windows
XP machine with Intel® Pentium® 4 2.02 GHz processor and 1GB of RAM.
4.1 Finite Element Meshes

Meshes and associated matrices are generated by partitioning a region into quadratic
triangular elements and simulating the Laplace equation with a standard Galerkin finite element
method.
The meshes we use are listed below. Many of them use the same domain with several
different mesh densities. Figure 4 illustrates the coarsest and the most refined mesh for each
example. The number of mesh points provides a perspective of the problem size, the larger of
which requires the better algorithms to efficiently speed up calculations.
Mesh Family Mesh Name Total Mesh

Description Points
2D Rectangle. 2d 91
3D Cube. 3d 1075
two_hole_0 511
two_hole_1 763
Square domain with
two_hole_2 1291
two circular holes.
two_hole_3 2823
two_hole_4 11113
four_hole_0 431
Long rectangular
four_hole_1 629
domain with four
circular holes. four_hole_2 1089
four_hole_3 2497
cross_dom_0 401
cross_dom_1 647
Cross-shaped domain. cross_dom_2 1141
cross_dom_3 2448
cross_dom_4 4261
two_dom_0 459
two_dom_1 925
Two square domains
two_dom_2 1787
connected by a long
rectangular strip. two_dom_3 2726
two_dom_4 4791
two_dom_5 10652
18
4.2 ILU(0) and ILUT
Matlab implements incomplete LU factorization in two ways: the threshold-based ILUT
method and the special case ILU(0) of the structure-based ILU( ) methods [21]. To use the
ILUT method, the user would call luinc( A , t ), with t being the tolerance, or threshold.
Factors of matrix A are computed, and entries smaller than t times the 2-norm of the column
vectors of A are dropped. Specifying zero tolerance produces factors with no dropped entries;
that is, luinc( A , 0) gives the same result as lu( A ). Another special case with tolerance being 1
drops all entries except those on the main diagonal. On the other hand, calling luinc( A , ‘0’)
results in the structure-based ILU(0). The L and U factors from luinc( A , ‘0’) have nonzero
entries only at locations where A has nonzero entries.
However, the other cases of structure-based ILU( ), where ≥ 1 , are unavailable. Not
only does Matlab lack of their implementation, many researchers also neglect to mention any
ILU( ) beyond ILU(0). In some literatures, it appears as if ILU(0) and ILUT are the only
LU-based preconditioners of importance. The reason is not yet trivial, but for now we should
use the tools readily available to learn more about the two classes of ILU.
4.2.1 ILU(0)
Before we make comparisons, let us review the behavior of ILU(0). Its most obvious
property is the predictable sparsity pattern: L or U has a nonzero entry if and only if A has
a nonzero entry at the same location. On the other hand, while ILUT can produce much denser
or sparser factors, the nonzero elements in the factors do not necessarily overlap those in A .
Figure 5 shows the U factor of ILU(0) and ILUT at three thresholds, superimposed over the
same matrix. Only ILU(0) preserves the original matrix’s sparsity pattern.
This feature can be very important, especially if the memory is limited. Given any square
matrix, we know a priori exactly how much memory is required to store its ILU(0) factors.
Besides, because it allows no fill-in, the dependency among the rows stays constant. This
simplifies parallel implementations, as discussed in the next section.
The danger of ILU(0), or of any structure-based ILU( ) method, is its insensitivity to the
magnitude of fill-ins. Potential LU elements are dropped only based on their locations, and not
values. Therefore, many small, insignificant elements may be preserved rather than large ones.
According to Karypis and Kumar, this can cause preconditioners to be ineffective for matrices
arising in many realistic applications [16].
4.2.2 ILUT
The sparsity pattern of the ILUT factors is unpredictable. We study the effect of the
threshold parameter through numerical experiments. Four matrices are chosen and
preconditioned with ILUT, using a wide range of threshold values between 0 and 1. The
19
condition number of the preconditioned system K (U −1 L−1 A) and the sparsity of the factors
( L + U − I ) are observed. Because we know intuitively that these two values are tradeoffs of
each other, we compare them on the same graph. These values are plotted in Figure 6 where the
log of the condition number is used to enhance the contrast among the smaller values.
As observed commonly among the four graphs, the precondition number and the number of
nonzero elements are roughly inversely related. However, the factors being less sparse do not
always imply a higher preconditioning quality. The local minimums on the precondition curves
indicate the best precondition numbers among all ILUT factors of similar sizes. To most
efficiently solve a large linear system, it would be of the user’s interest to take advantage of such
local minimums. Unfortunately, in practice they cannot be found without excessive
computation.
When the precondition number approaches 1, the ILUT factor sizes approach those of a
complete LU and become extremely memory-inefficient. On the other hand, as the memory
requirement is minimized ( t → 1 ), the precondition number becomes humongous (sometimes
even larger than K ( A) ). As discussed in Chapter 2, the ILUT preconditioner with t = 1 is
identical to the Jacobi (Diagonal) preconditioner. This experiment shows that such
preconditioning really does not help when the matrix is not tridiagonal. As a result of this study,
we see that ILUT is ineffective when the threshold is near either extreme.
4.2.3 Comparisons
In this section, we investigate how the structure-based factorizations differ from the
threshold-based versions in practice. We want to find out how their preconditioning qualities
compare. The answer is revealed by simply adding the statistics from ILU(0) in Figure 6.
On all four of the graphs, the blue circle indicates log of the ILU(0) precondition number,
and the red circle indicates the number of nonzero elements in the ILU(0) factors. The blue
circle being to the left of the red circle suggests that, in order to achieve the same precondition
number, ILUT generates more sparse factors. In other words, when the threshold is chosen so
that ILUT factors are as sparse as ILU(0), the ILUT precondition number is much lower. In all
four cases in this experiment, our findings show that the ILUT method is a more efficient
preconditioner than ILU(0): less memory and smaller precondition number.
Although our findings here really do not support ILU(0), its performance appears to
approach that of ILUT as the matrix size increases. When the matrix size is in the hundreds,
and assuming similar factor sizes, ILUT produces a precondition number that is roughly 1/8 of
the ILU(0) precondition number. When the matrix size goes over ten thousand, that ratio
increases to roughly 1/4. One may wonder as the matrix size grows, whether such ratio
approaches or even exceeds 1, in which case ILU(0) becomes the better preconditioner.
Unfortunately, since calculating the exact precondition number K (U −1 L−1 A) is impractically
resource-consuming, we are unable to carry out the same experiment on much larger matrices.
20
Now, suppose ILU(0)’s weak performance is lack of attention to element magnitudes (as
mentioned above), then we have a hypothetical idea of why structure-based ILU( ) with > 0
is not used in practice. While still possessing the same structure-based magnitude insensitivity
that makes it a weak preconditioner, it loses the clean and obvious sparsity pattern that ILU(0)
has. Therefore, ILU( ) with > 0 has little value beyond theoretical discussions.
4.3 CM and RCM

The next step would be to investigate the nodal reordering strategies and their effects on
ILU preconditioners. We study the classic algorithm published by E. Cuthill and J. McKee in
1969, and the “reversed” version which is more popular in the scientific computing community
today. These two algorithms are very similar, making absolutely no difference in certain
applications. It is not trivial why a simple reversal of an ordering would make a big difference.
However, as we shall see, Reverse Cuthill-McKee and the original CM can show much different
strengths for different preconditioners.
4.3.1 The Structure
Suppose we have matrix F , generated from a finite element mesh with the Cuthill-McKee
reordering, and matrix R , generated from the same mesh with the Reversed Cuthill-McKee
reordering. By casually examining the sparsity patterns of the two matrices, we find that they
are mirror images of each other – about the antidiagonal. Moreover, they have the same
bandwidth. Their only difference seems to be the alignments of the nonzero entries, as
illustrated in Figure 7.
In the Cuthill-McKee Algorithm, from a starting node i , its unassigned neighbors are given
numbers j , j + 1 , j + 2 … with j > i . Hence, in matrix F , many nonzero entries line up
with the pattern Fi , j , Fi , j +1 , Fi , j + 2 … with j > i . On the other hand, in Reverse
Cuthill-McKee, unassigned neighbors of node j are given numbers i , i − 1 , i − 2 … with
i < j . Hence, in matrix R , many nonzero entries line up with the pattern Ri , j , Ri −1, j ,
Ri − 2, j … with i < j . In other words, when we look at the upper triangular half of the matrices,
the nonzero entries in F line up horizontally, whereas the nonzero entries in R line up vertically
(reversed in the lower triangle).
Define an “umbrella region” of a matrix M as follows: M i , j = 0 is in the umbrella

region if i ≤ j and M k , j is nonzero for some k ≤ i , or if j < i and M i ,k for some k ≤ j .
Imagine two light sources shining directly on M : one to the left and one to the top. Suppose
each nonzero entry is a solid block, then all the area under shade would be in the “umbrella
region.”
When we perform an incomplete LU factorization, each possible fill-in is caused by a

nonzero entry to its left and one directly above it. For upper-triangular positions with no
21
nonzero entries above, or lower-triangular positions with no nonzero entries to the left, no fill-in
would ever occur. In other words, fill-ins always happen within the “umbrella region.”
Due to the nonzero alignment patterns, it is easy to see in Figure 8 that matrix F has a
much larger “umbrella region” than matrix R . Hence, more potential fill-ins would/could be
eliminated during ILU(0)/ILUT. Intuitively, the more fill-ins we eliminate, the more our
incomplete factors differ from the actual factors. Having less accurate factors, the ILU
preconditioner produces a less ideal (higher condition number) matrix.
4.3.2 The Experiments
Next, we would like to see the effects of nodal reordering on the ILU preconditioners.
Renumbering the nodes in a finite element mesh is equivalent to permuting the equations and
unknowns of the corresponding linear system. Although the system as a whole and its solutions
stay intact, the physical structure changes and preconditioning qualities could be affected.
We take the four matrices from the previous experiment (Section 4.2.2, Figure 6) and
reorder/permute them using the Reverse Cuthill-McKee algorithm. The ILUT preconditioner is
applied with the same threshold values as in the previous experiment, and the precondition
numbers and sparsity numbers are recorded accordingly. These numbers of the original (natural
ordering) matrices are subtracted from those of the new (RCM ordering) matrices. The
differences are plotted in Figure 9.
When the threshold is large, RCM ordering does not seem to benefit ILUT preconditioning.
The precondition numbers differ dramatically without a fixed direction or pattern. At the same
time, the sparsities of the factors are not affected much, if any. Therefore, when we use ILUT
with large thresholds, the RCM reordering algorithm brings unpredictable effects on the
preconditioning quality while saving no memory. In this case, the reordering is an unnecessary
waste of effort.
On the other hand, RCM ordering does seem to improve ILUT with small thresholds. The
difference in precondition numbers converges to zero as the threshold decreases, and the
difference in sparsity grows apart concurrently. RCM-ordered matrices are preconditioned to
the same quality with smaller ILUT factors. However, based on our experiment, the amount of
sparsity gained is not proportional to the size of the matrix or the size of the threshold.
Comparing graphs (b) and (c) in Figure 9, matrix (c) before preconditioning almost doubles the
size of matrix (b), but the ILUT memory it saves with RCM ordering is less than half, when the
threshold is small. In addition, graph (d) suggests that for this particular matrix, the memory
efficiency caused by RCM is maximized when threshold is slightly less than 10-3, and drops back
to zero as the threshold continues to decrease. In essence, RCM ordering could improve the
preconditioning qualities of ILUT, although the result is not guaranteed. We say “could”
because the red curve does not really start to drop until the threshold reaches below 10-2, in
which case the ILUT factors are larger than ILU(0) factors. When the sparsity difference
becomes more significant is also when the ILUT factors become quite dense. Since the purpose
of incomplete LU is to keep factors sparse, it would probably not be sensible to employ small
22
enough thresholds to see the difference between these two ordering strategies.
Recall that our real goal in preconditioning is to speed up the convergence rate of iterative
solvers, and precondition number is only a rough indication for this. The actual effectiveness of
preconditioners and reordering strategies need to be tested by actually solving the system with an
iterative solver. Tables 1.1~1.22 are the condition number, sparsity, and iterations to
convergence of our matrices (meshes described in Section 4.1), each with original ordering,
Cuthill-McKee ordering, and Reverse Cuthill-McKee ordering. Preconditioners applied are
ILU(0), Jacobi, and ILUT at 18 other threshold levels. The iterative solver used is Matlab’s
implementation of GMRES.
For each preconditioner on each table, the best (i.e. smallest) values are highlighted. At a
quick glance, we find the counterintuitive fact that sometimes a smaller condition number is
associated with a slower convergence! For example, on Table 1.3 (two_hole_0) with
ILUT(5e-2), the RCM produces the most sparse matrix with the smallest condition number, but it
took the solver two more iterations to convergence than matrices of natural or CM orderings. In
general, however, small condition numbers still lead to faster convergence. This shows us that
using the precondition number is a good predictor of the algorithm performance, but the best
predictor is to actually test it in the iterative solver.
For ILU(0) preconditioner, RCM reordering consistently produces the best results – lowest
condition number and, more importantly, fastest convergence. RCM-ordered matrices converge
in up to 27% fewer steps than natural-ordered matrices. On the other hand, CM tends to be a
very poor ordering for ILU(0), giving significantly worse numbers than natural ordering. From
our structural analysis, recall that the classic CM ordering generates a much larger “umbrella
region” than original ordering, in which ILU(0) eliminates many more potential fill-ins and
makes its factors less closer to complete LU. On the other hand, RCM with a smaller “umbrella
region” has an ILU(0) factorization much closer to complete LU.
For threshold-based ILUT, however, RCM is no longer so beneficial. Especially for

moderate thresholds (10-1 ~ 10-2), RCM-ordered matrices tend to cost more GMRES iterations
than the other two. Only when the thresholds (hence condition numbers) are very small, and
ordering makes no notable difference in convergence rates, does RCM has an advantage of
having the most sparse ILUT factors. Unless a low threshold level is required and memory is
extremely expensive, though, the little benefit of RCM does not even justify the reordering
process.
In contrast, the classic CM ordering seems to suit ILUT fairly well. Especially when the
matrix becomes large and the threshold is relatively small, it yields better convergence rate than
natural or RCM orderings. It is interesting to observe that such CM-ordered matrices often
have the largest precondition numbers, yet they still converge the fastest.
In a nutshell, we learn that the usefulness of each ordering strategy depends entirely on how
it is used. Although nothing is absolute, we do observe a general pattern for best-matching
ordering strategies, preconditioners, and problem sizes. Reverse Cuthill-McKee is consistently
the best ordering scheme for ILU(0), classic Cuthill-McKee ordering is great for ILUT with
23
moderate to small thresholds on large systems, and natural ordering should suffice by itself on
small problems.
4.4 Breadth-First Search Orderings

The Breadth-First Search (BFS) is an elementary algorithm for producing a spanning tree of
a graph [12]. In computer science terms, it is a method to traverse the entire graph for some
particular data. It starts from any node, searches all of its neighbors and orders them into a
waiting queue. Then, it repeats the same process on the next node in queue that still has
unsearched neighbors.
The Cuthill-McKee algorithm and its reversed version are essentially two special cases of
BFS, with some special requirements. Since the algorithm was originally developed with
bandwidth reduction in mind, not preconditioning quality, one may wonder if there exist other
BFS-based ordering schemes that better suit our interest.
Given a graph G , suppose the scheme assigns numbers low to high [high to low]
Select a node and label it 1 [ n ]
Select the smallest [largest] i such that node i has unlabeled neighbors
Locate all of node i ’s unlabeled neighbors ( u1 ,..., um )
In specified sorting order, label these nodes k + 1,..., k + m [ n − k ,..., n − k − m + 1 ]
Algorithm 8: Modified Breadth-First Search
We devise a test consisting of 14 BFS-based ordering schemes, each with some unique
requirements. There are seven ways to traverse the mesh by sorting a node’s neighboring nodes
differently. With each traversal, one scheme assigns numbers from low to high while another
goes from high to low. CM and RCM are included among these. Below is a table listing the
schemes used in our test.
Test # Sort neighbors of node i based on… Number assignment

Test01 Low to high
Existing ordering in the given system.
Test02 High to low
Test03 (CM) Degree of the neighboring nodes, in Low to high
Test04 ascending order. High to low
Test05 Degree of the neighboring nodes, in Low to high
Test06 (RCM) descending order. High to low
24
Test07 Physical distance of the neighboring Low to high
Test08 nodes from node i, in descending order. High to low
Test09 Physical distance of the neighboring Low to high
Test10 nodes from node i, in ascending order. High to low
Test11* For neighboring node j, the value of Low to high
Test12* Aij , in descending order High to low
Test13* For neighboring node j, the value of Low to high
Test14* Aij , in ascending order High to low
On this list, Test01 is the most generic version of Breadth-First Search, so the new ordering
highly depends on the existing natural ordering. Test03 and Test06 are classic CM and the
popular RCM, listed here for comparison purposes. The rest are some simple modifications of
the existing ideas, and they only represent a small fraction of possible Breadth-First Search
variants.
Test11 through Test14 are not finite element mesh orderings in the same sense as the others.
Because they are ordered based on the magnitudes of matrix A ’s entries, the orderings are not
available before the matrix is built from the mesh. They can be achieved by matrix
permutations and have a positive effect on ILU preconditioning. However, on computers that
permute by moving physical entries (such as on distributed-data machines), the cost of such
operation could become prohibitive. Disregarding this fact, rearranging a matrix according to
the actual magnitude of its elements might be very practically sensible. If deemed useful, these
schemes can still be applied to time-dependent problems where the cost of building/permuting
the matrix can be sacrificed for a higher efficiency at each time step of solving the problem.
We pick three meshes from each of our four domains, apply the test ordering schemes to
them, precondition using ILU(0) and nine thresholds of ILUT, then record their GMRES
convergence iterations. These numbers are recorded in Tables 2.1 ~ 2.4.
The purpose of this experiment is to find better ordering schemes than what we already
have from the previous experiment. Therefore, we use bold borders for columns A, 03, and 06,
which are natural, CM, and RCM orderings. For each row, the smallest iteration number
among the three orderings is determined. Then, any scheme that produces the same or better
iteration is highlighted green or yellow, respectively. In other words, a green entry means that
the ordering is as good as the best among natural, CM, and RCM; a yellow entry means that it is
better than all of the three. When a column has many green and yellow entries, that particular
ordering is probably what we are looking for.
First, we examine the ILU(0) case. RCM is still the best among orderings – with only one
exception. Test12 consistently produces the same quality matrix as RCM, and sometimes even
better. This permutation scheme has the same property as RCM which produces a small
“umbrella region” because it assigns numbers backwards (high to low). Now, because it
arranges the largest elements of A close to the main diagonal, it minimizes the overall
25
magnitudes of the eliminated fill-ins. While ILU(0) performs its structure-based incomplete
factorization, Test12 permutation gives it some of the threshold-based advantages. Therefore, it
is stronger than the popular RCM ordering.
Next, we look at the ILUT cases. It seems difficult to compare all of the columns at the
first glance, although it is apparent that some of our new ordering schemes are highly
comparable to or even better than natural, CM, and RCM. In the event that the threshold is
small, the ordering schemes which assign numbers forward perform better than their backward
equivalents. However, the distinction is less obvious when the threshold is large. After more
detailed examinations, we find that Test05, Test10, Test11, and Test13 have the best overall
performances. Disregarding Test11 and Test13, which are not true mesh reordering schemes,
there are still two very satisfying results. Test05 is merely a modified Cuthill-McKee algorithm,
with nodes sorted in descending degree order rather than ascending. Test10, on the other hand,
requires a slightly more sophisticated implementation and is more computationally expensive
due to the calculation for the physical distances between nodes.
Our experiment finds some BFS-based nodal reordering schemes that can generate better
numerical results than existing Cuthill-McKee and Reverse Cuthill-McKee algorithms. Our
schemes not only can assist ILU preconditioners and GMRES iterative solver into a faster
convergence, they are also very easy to understand and implement. What we still realize is that
not a single scheme can be perfect, so our choice should be made around each specific problem
that we want to solve and the preconditioner that we wish to use.
26
Chapter 5 The Parallel Case
Although ILU preconditioners can significantly improve the performance of iterative

solvers by reducing the precondition number, researchers might be reluctant to use them due to
their unparallelizable nature. The classic LU factorization is a completely sequential process,
and its derivatives are usually far from being parallelizable. Parallel versions of ILU
preconditioners do exist, thanks to contemporary research, but they have not become very
widespread. The popular parallel scientific computing software package PETSc [2], for
example, does not even have a parallel ILU implementation. Therefore, in this section we
examine a method of parallelizing ILU factorization.
This parallel ILU factorization algorithm is inseparable from its nodal reordering strategy.
This ordering is much more sophisticated than those in the single-processor cases, and the ILU
techniques only count for a small fraction of this algorithm. Although ILU(0) and ILUT are
both parallelizable based on this theory, the ILUT case is slightly more complicated.
5.1 Onto a Parallel Computer

Suppose we have a 48-node finite element mesh with its corresponding matrix A (see
Figure 10). Under natural ordering, the nonzero elements roughly form a few lines parallel to
and near the main diagonal, similar to some of our study cases in the previous section. Solving
this system using one processor is straightforward, given the tools that we have already discussed
before.
Next, assume that we want to solve this problem on a concurrent computer of four
processors. Then it is not so straightforward. First of all, we need to partition the mesh into
four pieces. This partitioning process is a nontrivial field of study itself. There are direct
k-way methods, recursive methods based on geometry or graph theory, and multi-level methods
[3, 9, 15, 23]. Efficient and robust partitioning algorithms often coarsen the mesh first, partition
using one of the simple methods, and then uncoarsen to the original mesh with optimization and
local refinement processes at each step. Typically, there are two common goals of these
algorithms: 1) partition the mesh into roughly equal sizes, so the processors have balanced
workload and 2) minimize the number of edges connecting two partitions, which minimizes the
amount of communication across the processors. We are not to go into details of these
algorithms, but we shall assume to have an ideal partition of the mesh.
Figure 10 (b) shows an ideal partitioning of this mesh, with each partition holding exactly
12 nodes. Each of the four colors represents a partition, which is held on one processor. Still
under natural ordering, we color each row of matrix A with the color that corresponds to the
processor where it resides. When the colors interleave, we know that this ordering leads to an
inefficient ILU preconditioner. Consider row 22 (the first yellow row). To process this row,
A22,27 needs information from A19,27 , which lies on the red processor. And to process row 19,
A19,23 requires information from A15,23 , which lies on the blue processor. To factor A under
27
such a configuration, a large number of rows have to be passed back and forth among the
processors at each step, causing the factorization to be unbearably expensive.
One intuitive way to improve this situation is to use some simple nodal reordering to group
each processor’s rows together. For example, we can assign a new number to every node in one
processor before moving on to another, so matrix A looks like in Figure 10 (c). Now the
inter-processor communication is somewhat reduced, and factoring the rows in the first partition
requires no information from the other processors. However, while this ordering scheme can
benefit sequential algorithms, the rows’ dependence on each other still prevents ILU from
running concurrently on all processors.
5.2 The Reordering Scheme

Now, we introduce the nodal reordering scheme for a parallel sparse factorization as
described by Karypis and Kumar [16].
First, classify the nodes in each processor as interior or interface nodes. Interior nodes are
those with adjacent nodes residing on the same processor, while interface nodes have adjacent
nodes on two or more processors. Figure 11 (a) highlights all the interface nodes. Assign new
numbers to the interior nodes, one processor at a time. When this is finished, there is an m < n
such that nodes k ≤ m are all interior nodes, and nodes k ≥ m + 1 are all interface nodes. In
our case, as illustrated in Figure 11 (b), m = 26 .
The next step is to compute maximal independent sets from the remaining interface nodes,
denoted by AI . An independent set I of a graph G is a subgraph in which no two nodes
are adjacent. Maximal independent sets can be found using Algorithm 9: Luby’s algorithm [19].
Once we have a maximal independent set I of AI , we assign new numbers, in order, to the
nodes in I . Afterwards, AI = AI \ I , I = ∅ , and the process repeats until AI is empty.
Figure 11 (c) ~ (f) illustrate the successive steps of independent sets, along with their new nodal
numbering. Note that at each step, nodes in I are numbered with regard to the order of the
processors.
Figure 11 (g) shows the original mesh with new the nodal numbers, and Figure 11 (h) is the
corresponding matrix A . At the first glance, this matrix seems very poorly ordered. The
nonzero elements are spread out without a clear pattern, and the bandwidth is huge. On a single
processor, as we have already examined, such arrangement yields poor conditioning. Also,
each processor holds largely disjoint rows when row > m , which does not seem to improve from
the original ordering. However, the independence among these new rows enables this
seemingly messy matrix to be highly parallelizable.
28
To compute a maximal independent set I of a given graph G
G =G
While G > 0
For i = 1 to G
Label(Gi )=Random()
End
For i = 1 to G
If Label(Gi )<Label(G j ) for all adj(Gi ,G j )
{ }
I = I ∪ Gi
End
End
G =G\I
{ }
G = G \ Gi ∈ G | ∃I k ∈ I ∋ adj (Gi , I k )
End
Algorithm 9: Luby’s algorithm
5.3 ILU Analysis

As for the mesh, we should treat the interior rows and interface rows of matrix A (Figure
11 (h)) separately. The steps of factorization are illustrated in Figure 12.
The first and entirely parallelizable part of the ILU factorization of A is its interior rows.
As highlighted in Figure 12 (a) with vertical bars, different processors do not have nonzero
elements in the same column. In other words, the interior rows in one processor are completely
independent from interior rows in other processors. Therefore, all processors can factor their
own interior rows simultaneously. No waiting or inter-processor communication is required.
In fact, since each set of these rows can be viewed as an independent matrix, we can locally
enhance it with reordering strategies, such as those discussed in the previous section.
Then, as illustrated in Figure 12 (b) ~ (d), one independent set of interface rows is factored
at a time. While factoring these rows requires information from the above rows, some on
foreign processors, it can still be run in parallel. Because the interface rows within one
independent set do not depend on any row that is within the same independent set and on another
processor, there is no need for the processors to wait for each other. In the best case, all
processors would have equal number of rows within each independent set. Therefore, they can
29
finish together and no processor needs to wait for another before moving on to the next set.
Although this nodal ordering strategy can be used for both structure-based ILU( ),
threshold-based ILUT, and even complete LU algorithms in parallel, slight differences apply to
each implementation. ILU(0) is the simplest, because it requires no fill-in whatsoever and
independent rows remain independent. ILUT is more complicated because its fill-ins introduce
new dependencies during factorization. When computing the independent sets, possible fill-ins
have to be taken into consideration to keep the sets truly independent. As a consequence, each
independent set could be smaller than that of an ordering for ILU(0).
Since the interior rows can be factored completely in parallel, it is to our best interest to
have as many of them as possible. When the mesh/matrix is relatively large compared to the
number of partitions/processors, most of the nodes/rows would be interior. The more interior
nodes outnumber interface nodes, the closer the factorization is to true parallel. Increasing the
number of processors increases the speed of the factorization in parallel, but at the same time
reduce the parallelizability by increasing the percentage of interface rows. In any event, a
high-quality mesh partitioning algorithm is critical for the effectiveness of the parallel ILU.
5.4 A Partitioning Test

To better illustrate our analysis, we partition some of our test meshes and discuss their
behavior when the preconditioner is implemented on parallel processors.
Our partitioning method for square domains is completely location-based, and our goal is to
assign an equal number of nodes/rows to each processor. First, we draw a horizontal line across
the middle of the mesh. Then, we compare the numbers of nodes falling on either sides of the
line, and move the line toward the smaller side. Eventually, the line bisects the mesh.
Repeatedly applying this process to the two halves of the mesh can easily give us 2n equally
sized partitions. The quality of our partitioning is fairly high, and can be easily automated.
The downside is that this process can be very costly. However, for large problems arising from
time-dependent nonlinear PDEs, this preprocessing cost is negligible when compared to the
remaining calculations.
We choose to partition three levels of refinement of the same domain, each into 2, 4, and 16
pieces, as graphed in Figure 13. The largest mesh (two_hole_4) has over 25 times more nodes
than the smallest mesh (two_hole_0). Because the nodes lay fairly evenly across the domain,
each of our partitions takes up nearly the same physical area. The two holes in the domain
cause the nodal bisection lines to shift; otherwise we should see perfect grids. On each mesh,
different colors indicate the different processors assigned to handle the nodes. Interior nodes
are labeled with circles and interface nodes are labeled with asterisks – though the symbols are
not visible on the largest mesh where all nodes are crammed together.
We list in Tables 3.1 ~ 3.3 the number of total nodes and the number of interior nodes in
each of the nine graphs. Note that our algorithm distributes nodes among the processors as
evenly as possible – although this is not usually feasible. The percentage of interface nodes is
30
listed on the side. The magnitude of this number is the most critical indication of the parallel
efficiency.
For each of the three meshes, the percentage of interface nodes increases as the number of
partitions increases. In particular, when the smallest mesh is distributed to 16 processors, well
over half of its total nodes are interface nodes! Despite the amount of concurrent computing
power available, the majority of our parallel ILU efforts would be wasted transferring data back
and forth among processors. It is very possible that 4 processors can factor this system faster
than 16 processors together.
Also, when there are more partitions, the amount of interface nodes per partition varies
more. When the smallest mesh is partitioned in 4, the number of interface nodes per partition
ranges between 27 and 32. On the other hand, when the same mesh is partitioned into 16, the
range increases to be between 6 and 23. Such large difference among partitions is very
undesirable, because it means great workload disparity for the processors at every step in the
parallel ILU algorithm.
Another thing to observe is the inversely proportional relationship between the mesh size
and the percentage of interface nodes. For all of 2-, 4-, and 16-processor cases, the percentage
drops about one-third from Tables 3.1 to 3.2, and another two-thirds from Tables 3.2 to 3.3.
When the problem size is large, partitioning the mesh would yield relatively few interface nodes
and make it sensible to employ a large number of processors. In Table 3.3 with 16 processors,
we see that 83.85% of all nodes are interior and can be ILU-factored simultaneously without
inter-processor communications. This speeds up the factorization significantly and justifies the
use of many processors.
The main lesson from this experiment is that, the efficiency of parallel ILU does not
necessarily increase with the number of parallel processors. We must regard the nature of the
required nodal reordering scheme, and consider the added dependency among processors as the
mesh is split into smaller partitions. When we can choose an appropriate level of parallelism,
the algorithm can run efficiently in near true parallel.
5.5 Other Partitioning Considerations

While the aforementioned partition strategy is capable of partitioning any mesh into equal
sizes, it can often generate undesirable partitions. Figure 14 (a) and (b) are meshes four_hole_1
and cross_dom_2 partitioned in four using such strategy. The partition borders lie parallel to
the long sides of the mesh, and are clearly causing more interface nodes than necessary. Only
66.04% and 74.10% of those mesh points are interior, respectively. However, if we partition
according to the shapes of the meshes, we can still preserve equal sizes while reducing the
amount of interface. Figure 14 (c) and (d), for example, are those two meshes with a new
dividing strategy. Now their interior nodes increase to 83.11% and 88.39%, and can run in
parallel much more efficiently.
31
Chapter 6 Conclusions
The aim of this thesis is to study nodal reordering strategies for finite element meshes that
speed up the convergence rate of an iterative solver. We start out by examining some classic
linear system solvers, iterative linear solvers, preconditioners, and reordering strategies. Then,
we proceed to numerically experiment some of the schemes and methods, and to analyze the
strategies on a single processor and on multiple processors.
In the single-processor case, we first compare the classic Cuthill-McKee ordering and the
popular Reverse Cuthill-McKee. While they equally reduce the matrix’s bandwidth, they
behave very differently before preconditioners: RCM only works with structure-based ILU(0),
and CM works best with ILUT with small thresholds. Then, we examine a list of similar
ordering strategies based on the concept of Breadth-First Search, and find that some of them
improve preconditioning qualities even more than the well-known CM and RCM.
In the multi-processor case, we learn that the parallel ILU algorithm is highly dependent on
a nontrivial ordering strategy. Whether or not the algorithm can run efficiently in parallel is
determined by the quality of the mesh partitioning, and the goal is to minimize the amount of
interface nodes in each partition. Assuming perfect partitions exist, the number of partitions
and parallelizability are negatively related. Unless the problem size is large enough, employing
too many processors might actually slow down the computations.
To sum up, we discover that nodal reordering strategies go side-by-side with

preconditioners. While one strategy works very poorly for one preconditioner, it can be the best
choice for another. No ordering is perfect in all situations, so we must be careful when
choosing the right one for each individual problem.
32
Figures and Tables
Figure 1 Incomplete LU Factorization

The main disadvantage of LU Factorization is the enormous memory space requirement.
Incomplete LU highly reduces this cost.
Matrix A : 582 nonzero entries.
A = LU : 1950 total nonzero entries.
ILU(0): A ≈ LU : 648 total nonzero entries.
33
Figure 2 Cuthill-McKee Ordering
This figure demonstrates Cuthill-McKee Algorithm’s ability to reduce bandwidth in matrices. In
the second matrix, nonzero elements are rearranged close to the main diagonal, unlike those
scattered apart in the first matrix. The mesh should be viewed as a cylinder, where the bottom
gray row “wraps around” to the top row.
Natural ordering. Bandwidth: 36
Cuthill-McKee ordering. Bandwidth: 5
34
Figure 3 Cuthill-McKee Starting Node
To reach the smallest bandwidth using Cuthill-McKee Algorithm, sometimes we do not want to
start at a node of lowest degree:
Starting with a node of lowest degree. Bandwidth: 5.
Starting with some other node. Bandwidth: 3.
35
Figure 4 Finite Element Meshes
two_hole_0 two_hole_4
four_hole_0
four_hole_3
36
cross_dom_0 cross_dom_4
two_dom_0
two_dom_5
37
Figure 5 ILU(0) and ILUT
These figures are for comparison among several ILU implementations’ sparsity patterns. ILU(0)
and ILUT with three thresholds are applied on the same matrix. Incomplete factor U is
superimposed in red over original matrix in blue.
ILU(0) ILUT(0.1)
ILUT(0.01) ILUT(0.001)
38
Figure 6 ILU Experiments
(a) 2d
t Precond nnz
ILU(0) 39.266 582
ILUT
0.02 7.3877 529
0.01 3.2354 670
(b) two_hole_0
t Precond nnz
ILU(0) 43.118 4081
ILUT
0.02 12.249 3985
0.01 7.3715 4862
(c) two_hole_1
T Precond nnz
ILU(0) 66.007 6407
ILUT
0.02 17.477 6159
0.01 9.3389 7396
(d) two_hole_2
t Precond nnz
ILU(0) 134.63 11699
ILUT
0.02 41.508 11106
0.01 21.438 13761
39
Figure 7 CM and RCM
Natural ordering
Cuthill-McKee ordering, where nonzero entries line up “horizontally” in the upper triangle.
Reverse Cuthill-McKee order, where nonzero entries line up “vertically” in the upper triangle.
40
Figure 8 “Umbrella Regions”
Although Cuthill-McKee Algorithm and Reverse Cuthill-McKee are very similar, the matrices
they generated have dramatically different “umbrella regions.” Therefore, Incomplete LU
factorizations behave rather differently on them.
Matrix F : Cuthill-McKee ordering. Umbrella region: 1020
Matrix R : Reverse Cuthill-McKee ordering. Umbrella region: 508
41
Figure 9 Natural v. RCM Ordering on ILUT
Matrices from the Figure 6 are reordered with Reverse Cuthill-McKee ordering, and ILUT
preconditioning is applied with the same thresholds. Precondition numbers and sparsity
numbers of the natural-ordering matrices are subtracted from those of the RCM-ordering
matrices. The differences are plotted below.
(a) 2d (b) two_hole_0
(c) two_hole_1 (d) two_hole_2
42
Figure 10 Mesh Partitioning
(a) Original mesh
(b) Partitioned mesh
(c) One possible ordering for parallel
43
Figure 11 Mesh Partitioning for Parallel ILU
(a) (b)
(c) (d)
44
(e) (f)
(g) (h)
45
Figure 12 Parallel ILU
(a) (b)
(c) (d)
46
Figure 13 Mesh Partitioning Test 1
2 Processors 4 Processors 16 Processors
(a) two_hole_0: 415 × 415
(b) two_hole_2: 1115 × 1115
(c) two_hole_4: 10597 × 10597
47
Figure 14 Mesh Partitioning Test 2
(a)
(b)
(c)
(d)
48
Table 1 CM v. RCM and ILU(0) v. ILUT
Table 1.1
2d
Precondition nnz(A) or GMRES

Number nnz(L+U-I) Iterations
Natural RCM CM Natural RCM CM Natural RCM CM

A 136.77 136.77 136.77 582 582 582 36 36 36
ILU(0) 39.27 22.86 72.39 582 582 582 11 8 11
Jacobi 158.22 158.22 158.22 66 66 66 30 30 30
ILUT(5e-1) 153.46 150.31 150.31 71 71 71 32 31 31
ILUT(4e-1) 143.08 152.19 150.31 77 77 77 33 32 32
ILUT(3e-1) 143.08 152.19 150.31 77 77 77 33 32 32
ILUT(2e-1) 28.61 84.71 82.46 296 300 296 12 13 13
ILUT(1e-1) 34.77 31.43 111.11 321 328 325 12 11 12
ILUT(9e-2) 37.03 31.43 71.30 339 328 340 11 11 12
ILUT(8e-2) 25.97 31.46 99.52 378 329 366 10 11 11
ILUT(7e-2) 20.65 24.47 73.22 400 348 404 9 10 10
ILUT(6e-2) 10.38 32.76 39.02 430 385 456 8 9 9
ILUT(5e-2) 9.83 32.56 19.44 440 424 475 8 9 8
ILUT(4e-2) 10.54 21.18 19.30 452 443 478 8 8 8
ILUT(3e-2) 10.68 10.49 20.54 497 472 508 7 6 8
ILUT(2e-2) 7.39 6.73 8.82 529 483 586 7 6 6
ILUT(1e-2) 3.24 2.97 6.69 670 551 737 6 5 5
ILUT(5e-3) 2.65 2.17 2.25 807 581 917 5 4 4
ILUT(1e-3) 1.15 1.26 1.22 1103 682 1130 3 3 3
ILUT(5e-4) 1.08 1.11 1.10 1183 701 1174 3 3 3
ILUT(1e-4) 1.01 1.02 1.01 1270 763 1191 2 2 2
49
Table 1.2
3d


A 192.32 192.32 192.32 14640 14640 14640 39 39 39
ILU(0) 19.92 18.58 33.99 14640 14640 14640 9 9 12
Jacobi 188.82 188.82 188.82 618 618 618 35 35 35
ILUT(5e-1) 188.82 188.82 188.82 618 618 618 35 35 35
ILUT(4e-1) 188.82 188.82 188.82 618 618 618 35 35 35
ILUT(3e-1) 189.67 188.31 188.46 643 643 643 36 36 34
ILUT(2e-1) 179.67 179.79 187.75 703 703 703 35 35 34
ILUT(1e-1) 190.43 163.76 132.80 3724 3857 3665 19 18 19
ILUT(9e-2) 190.43 163.52 144.34 3724 3973 3720 19 18 19
ILUT(8e-2) 182.19 142.24 148.55 3878 4020 3934 19 17 18
ILUT(7e-2) 179.78 88.88 97.37 5667 5680 5382 14 13 16
ILUT(6e-2) 119.84 71.44 128.72 6117 6376 5944 13 12 15
ILUT(5e-2) 178.40 80.06 146.33 6325 6699 6174 13 11 15
ILUT(4e-2) 155.44 76.12 277.73 6761 6841 7120 11 11 14
ILUT(3e-2) 50.80 37.76 88.41 7388 7471 8138 10 10 12
ILUT(2e-2) 68.65 30.75 132.54 9809 8438 10888 9 9 11
ILUT(1e-2) 39.37 12.91 114.25 14580 12820 17209 8 7 9
ILUT(5e-3) 7.66 9.77 8.36 19481 17333 22371 6 6 7
ILUT(1e-3) 1.89 2.09 2.83 35000 25508 42476 4 4 5
ILUT(5e-4) 1.53 1.48 1.81 42652 30245 53600 4 4 4
ILUT(1e-4) 1.07 1.10 1.17 66287 42460 86616 3 3 3
50
Table 1.3
two_hole_0


A 166.91 166.91 166.91 4081 4081 4081 54 54 54
ILU(0) 43.12 37.47 63.65 4081 4081 4081 14 11 15
Jacobi 180.52 180.52 180.52 415 415 415 53 53 53
ILUT(5e-1) 180.52 180.52 180.52 415 415 415 53 53 53
ILUT(4e-1) 180.52 180.52 180.52 417 417 417 53 53 53
ILUT(3e-1) 185.49 184.38 184.34 437 441 438 52 52 52
ILUT(2e-1) 200.78 196.61 195.63 974 980 971 43 42 43
ILUT(1e-1) 88.10 51.08 87.38 2531 2546 2543 16 18 16
ILUT(9e-2) 92.96 48.41 72.58 2587 2617 2598 15 17 15
ILUT(8e-2) 55.36 48.93 71.78 2632 2722 2652 14 16 15
ILUT(7e-2) 54.55 55.58 71.90 2702 2802 2756 14 16 14
ILUT(6e-2) 54.90 50.38 71.90 2810 2907 2947 13 15 13
ILUT(5e-2) 43.64 39.68 50.83 3042 3001 3218 12 14 12
ILUT(4e-2) 31.04 31.98 35.97 3378 3185 3571 11 13 11
ILUT(3e-2) 19.59 22.68 25.58 3679 3553 3851 10 11 10
ILUT(2e-2) 12.25 11.80 16.63 3985 4003 4205 9 9 9
ILUT(1e-2) 7.37 6.00 10.42 4862 4447 5453 7 7 7
ILUT(5e-3) 4.38 4.67 7.87 5731 5005 6541 6 6 6
ILUT(1e-3) 1.70 1.60 1.56 8791 6357 9873 4 4 4
ILUT(5e-4) 1.25 1.41 1.28 9924 7004 11320 3 4 3
ILUT(1e-4) 1.04 1.05 1.06 12252 8599 14197 3 3 3
51
Table 1.4
two_hole_1


A 252.69 252.69 252.69 6407 6407 6407 66 66 66
ILU(0) 66.01 47.17 73.63 6407 6407 6407 15 14 17
Jacobi 231.77 231.77 231.77 635 635 635 64 64 64
ILUT(5e-1) 231.77 231.77 231.77 635 635 635 64 64 64
ILUT(4e-1) 231.77 231.77 231.77 640 640 640 64 64 64
ILUT(3e-1) 245.91 236.85 245.91 698 707 700 63 63 63
ILUT(2e-1) 315.20 298.96 321.03 1475 1483 1497 56 55 56
ILUT(1e-1) 131.83 69.41 154.31 3962 3993 3970 18 22 19
ILUT(9e-2) 126.96 70.98 139.44 4050 4166 4060 17 21 18
ILUT(8e-2) 122.35 73.71 129.76 4123 4329 4153 16 19 17
ILUT(7e-2) 108.58 78.46 116.07 4215 4471 4269 16 18 17
ILUT(6e-2) 86.98 57.40 101.03 4413 4588 4573 15 17 15
ILUT(5e-2) 86.49 62.21 68.92 4763 4754 5049 14 16 14
ILUT(4e-2) 34.28 42.85 42.20 5246 5038 5614 12 15 12
ILUT(3e-2) 30.21 53.20 36.93 5633 5626 6006 11 12 11
ILUT(2e-2) 17.48 26.80 26.13 6159 6359 6633 10 10 10
ILUT(1e-2) 9.34 11.39 21.15 7396 7089 8631 8 8 8
ILUT(5e-3) 6.85 8.56 8.39 8704 8047 10571 7 7 7
ILUT(1e-3) 2.04 2.06 2.13 12112 10771 16182 4 4 4
ILUT(5e-4) 1.56 1.43 1.55 13589 12160 18575 4 4 4
ILUT(1e-4) 1.07 1.11 1.08 16580 15284 23458 3 3 3
52
Table 1.5
two_hole_2


A 506.16 506.16 506.16 11699 11699 11699 79 79 79
ILU(0) 134.63 85.76 139.69 11699 11699 11699 18 16 22
Jacobi 395.11 395.11 395.11 1115 1115 1115 78 78 78
ILUT(5e-1) 395.11 395.11 395.11 1116 1116 1116 78 78 78
ILUT(4e-1) 395.11 395.11 395.11 1117 1117 1117 78 78 78
ILUT(3e-1) 398.60 398.60 398.60 1133 1136 1134 78 78 78
ILUT(2e-1) 486.35 510.30 527.67 2105 2151 2111 70 67 71
ILUT(1e-1) 234.51 138.12 329.64 7293 7470 7268 22 27 23
ILUT(9e-2) 232.45 120.64 241.77 7342 7683 7336 21 25 23
ILUT(8e-2) 250.78 118.04 291.68 7416 7980 7425 21 22 22
ILUT(7e-2) 184.92 112.49 211.59 7541 8164 7652 20 21 21
ILUT(6e-2) 174.74 114.65 183.70 7773 8291 8299 19 21 19
ILUT(5e-2) 136.67 98.77 163.14 8521 8535 9332 18 20 17
ILUT(4e-2) 81.58 92.77 140.51 9517 9045 10483 15 18 16
ILUT(3e-2) 59.45 63.79 108.37 10291 10322 11231 14 15 14
ILUT(2e-2) 41.51 38.02 68.35 11106 11505 12254 12 12 13
ILUT(1e-2) 21.44 23.97 24.40 13761 12891 16535 10 10 10
ILUT(5e-3) 11.17 12.85 14.65 16379 15044 20295 8 8 7
ILUT(1e-3) 2.81 2.72 3.47 23943 21700 33531 5 5 5
ILUT(5e-4) 1.89 1.98 2.03 27428 24980 40085 4 4 4
ILUT(1e-4) 1.18 1.18 1.15 34474 33011 53683 3 3 3
53
Table 1.6
two_hole_3


A 1065.99 1065.99 1065.99 27883 27883 27883 114 114 114
ILU(0) 328.00 206.14 296.25 27883 27883 27883 25 24 31
Jacobi 934.49 934.49 934.49 2567 2567 2567 111 111 111
ILUT(5e-1) 934.49 934.49 934.49 2567 2567 2567 111 111 111
ILUT(4e-1) 934.49 934.49 934.49 2576 2576 2576 111 111 111
ILUT(3e-1) 935.46 898.72 938.47 2643 2643 2642 109 108 108
ILUT(2e-1) 1247.14 1217.30 1270.78 4390 4475 4426 100 97 98
ILUT(1e-1) 577.94 350.75 548.36 16983 17529 16978 30 38 32
ILUT(9e-2) 449.01 256.82 558.75 17101 18030 17136 30 36 32
ILUT(8e-2) 484.19 282.09 540.38 17260 18884 17323 30 31 31
ILUT(7e-2) 484.15 245.62 505.88 17525 19236 17848 29 30 30
ILUT(6e-2) 424.44 251.24 445.49 18111 19447 19366 28 30 27
ILUT(5e-2) 397.04 266.47 496.63 19901 20173 22176 26 28 24
ILUT(4e-2) 195.99 216.41 300.88 22712 21503 25139 22 27 19
ILUT(3e-2) 132.32 151.38 213.08 24523 24558 26693 20 22 18
ILUT(2e-2) 104.28 80.63 130.51 26820 27617 29470 17 17 17
ILUT(1e-2) 43.35 53.91 53.82 33217 31210 39963 14 15 13
ILUT(5e-3) 33.77 34.61 24.33 40664 37000 49617 11 12 11
ILUT(1e-3) 6.90 9.84 8.80 63671 56666 89423 6 7 6
ILUT(5e-4) 3.81 4.12 5.56 75986 67540 112382 5 6 5
ILUT(1e-4) 1.68 1.55 1.57 103552 94585 167999 4 4 3
54
Table 1.7
two_hole_4


A 118292 118292 118292 208 208 208
ILU(0) 118292 118292 118292 49 43 53
Jacobi 10597 10597 10597 179 179 179
ILUT(5e-1) 10598 10598 10598 179 179 179
ILUT(4e-1) 10626 10626 10626 179 179 180
ILUT(3e-1) 10741 10746 10741 176 175 176
ILUT(2e-1) 13816 13978 13770 175 175 177
ILUT(1e-1) 72393 75864 72463 52 69 54
ILUT(9e-2) 72567 77342 72685 52 66 54
ILUT(8e-2) 72726 81864 72929 52 53 54
ILUT(7e-2) 73010 82410 74264 52 53 52
ILUT(6e-2) 75318 82700 81967 50 53 46
ILUT(5e-2) 85564 86339 95143 43 51 41
ILUT(4e-2) 99038 91476 108045 42 49 37
ILUT(3e-2) 105946 105013 111061 36 40 35
ILUT(2e-2) 114263 115975 119316 32 31 32
ILUT(1e-2) 147046 130612 170180 26 27 23
ILUT(5e-3) 185314 155892 215882 21 22 18
ILUT(1e-3) 359415 268643 436965 12 13 10
ILUT(5e-4) 467321 330159 565932 9 11 8
ILUT(1e-4) 774166 525546 971773 6 6 5
* Precondition numbers are not computed due to excessive memory requirements.
55
Table 1.8
four_hole_0


A 73.20 73.20 73.20 2783 2783 2783 36 36 36
ILU(0) 14.84 14.77 21.45 2783 2783 2783 9 9 11
Jacobi 64.95 64.95 64.95 315 315 315 36 36 36
ILUT(5e-1) 64.95 64.95 64.95 315 315 315 36 36 36
ILUT(4e-1) 65.59 65.59 66.33 319 319 319 36 36 36
ILUT(3e-1) 68.42 68.42 67.24 358 359 355 34 33 34
ILUT(2e-1) 84.56 89.39 83.27 774 776 773 29 29 29
ILUT(1e-1) 35.69 22.61 22.50 1802 1834 1828 12 13 11
ILUT(9e-2) 26.92 17.67 21.30 1835 1907 1869 11 12 11
ILUT(8e-2) 23.05 19.12 22.03 1884 1979 1926 11 11 11
ILUT(7e-2) 18.74 15.98 21.33 1942 2031 1988 10 11 10
ILUT(6e-2) 16.23 15.40 16.82 2016 2069 2084 9 11 10
ILUT(5e-2) 17.06 15.85 11.79 2163 2152 2262 9 10 9
ILUT(4e-2) 6.55 10.73 9.38 2339 2261 2425 8 9 8
ILUT(3e-2) 4.57 10.51 8.83 2506 2485 2606 7 7 7
ILUT(2e-2) 4.45 4.65 5.50 2690 2743 2810 6 6 6
ILUT(1e-2) 2.57 4.33 2.89 3074 3012 3444 5 5 5
ILUT(5e-3) 1.88 1.64 1.90 3372 3255 3907 4 4 4
ILUT(1e-3) 1.16 1.10 1.17 4265 3794 5099 3 3 3
ILUT(5e-4) 1.07 1.06 1.07 4559 4014 5545 3 3 3
ILUT(1e-4) 1.01 1.01 1.01 5137 4487 6242 2 2 2
56
Table 1.9
four_hole_1


A 113.35 113.35 113.35 4647 4647 4647 46 46 46
ILU(0) 20.90 24.32 40.81 4647 4647 4647 10 10 13
Jacobi 102.92 102.92 102.92 493 493 493 44 44 44
ILUT(5e-1) 102.92 102.92 102.92 493 493 493 44 44 44
ILUT(4e-1) 102.92 102.92 103.27 499 499 499 45 45 45
ILUT(3e-1) 113.98 109.15 109.82 543 546 543 42 42 42
ILUT(2e-1) 128.62 125.10 126.60 1110 1123 1124 37 36 37
ILUT(1e-1) 31.38 32.38 50.21 2984 3001 2992 13 15 14
ILUT(9e-2) 32.05 34.55 53.11 3027 3128 3041 13 14 13
ILUT(8e-2) 32.10 31.50 40.38 3071 3232 3094 12 13 13
ILUT(7e-2) 32.32 23.85 34.40 3119 3310 3186 12 13 13
ILUT(6e-2) 29.81 23.70 30.70 3221 3369 3367 12 13 11
ILUT(5e-2) 19.88 21.78 20.34 3504 3475 3721 11 12 10
ILUT(4e-2) 13.94 20.41 14.90 3870 3695 4089 9 10 9
ILUT(3e-2) 9.25 16.27 12.79 4164 4107 4377 9 9 8
ILUT(2e-2) 6.49 6.60 11.92 4446 4597 4707 8 7 8
ILUT(1e-2) 3.22 3.60 3.42 5289 4994 6012 6 6 6
ILUT(5e-3) 2.37 2.26 2.60 6107 5538 6917 5 5 5
ILUT(1e-3) 1.30 1.22 1.26 8432 6737 9665 3 3 3
ILUT(5e-4) 1.14 1.13 1.08 9220 7256 10662 3 3 3
ILUT(1e-4) 1.02 1.02 1.03 10640 8521 12367 2 2 2
57
Table 1.10
four_hole_2


A 194.41 194.41 194.41 9029 9029 9029 57 57 57
ILU(0) 60.81 32.41 52.77 9029 9029 9029 14 12 16
Jacobi 202.06 202.06 202.06 897 897 897 54 54 54
ILUT(5e-1) 202.06 202.06 202.06 897 897 897 54 54 54
ILUT(4e-1) 202.06 202.06 202.06 904 904 904 54 54 54
ILUT(3e-1) 207.16 208.86 207.00 972 981 970 53 53 53
ILUT(2e-1) 221.65 220.91 221.69 2185 2161 2187 46 46 47
ILUT(1e-1) 89.18 71.58 99.84 5570 5669 5568 17 20 18
ILUT(9e-2) 78.53 49.88 97.87 5639 5836 5668 16 18 17
ILUT(8e-2) 78.98 59.91 97.75 5746 6058 5814 15 17 15
ILUT(7e-2) 59.15 60.33 73.36 5893 6207 6005 14 16 14
ILUT(6e-2) 45.78 61.35 56.99 6153 6379 6408 13 16 13
ILUT(5e-2) 50.82 45.81 40.75 6680 6621 7077 13 15 12
ILUT(4e-2) 21.53 36.67 30.76 7385 7029 7866 12 13 11
ILUT(3e-2) 17.31 21.34 30.08 7983 7821 8482 10 11 10
ILUT(2e-2) 12.94 15.22 19.71 8652 8791 9331 9 9 9
ILUT(1e-2) 7.31 8.17 11.19 10637 9794 12074 7 7 7
ILUT(5e-3) 4.02 4.41 4.64 12620 11077 14629 6 6 6
ILUT(1e-3) 1.59 1.59 1.76 18185 14407 21965 4 4 4
ILUT(5e-4) 1.34 1.26 1.28 20433 15893 25008 3 3 3
ILUT(1e-4) 1.04 1.05 1.04 24465 19395 30526 2 3 2
58
Table 1.11
four_hole_3


A 425.08 425.08 425.08 23534 23534 23534 80 80 80
ILU(0) 183.36 92.67 159.24 23534 23534 23534 20 17 21
Jacobi 423.34 423.34 423.34 2217 2217 2217 68 68 68
ILUT(5e-1) 423.34 423.34 423.34 2217 2217 2217 68 68 68
ILUT(4e-1) 424.15 424.15 425.35 2244 2244 2244 69 68 69
ILUT(3e-1) 435.30 435.00 447.75 2328 2331 2325 67 67 67
ILUT(2e-1) 576.67 510.01 535.95 3960 4005 3948 68 66 68
ILUT(1e-1) 245.42 126.05 262.47 14476 14885 14475 22 27 23
ILUT(9e-2) 251.67 142.01 263.60 14619 15318 14619 22 26 22
ILUT(8e-2) 252.37 129.75 263.95 14765 16000 14815 21 24 21
ILUT(7e-2) 228.25 118.87 230.96 15042 16413 15205 21 21 20
ILUT(6e-2) 175.59 90.86 205.27 15620 16654 16520 19 20 18
ILUT(5e-2) 199.08 118.47 166.86 17152 17370 18642 19 19 17
ILUT(4e-2) 117.18 104.19 127.00 19450 18424 21092 15 18 14
ILUT(3e-2) 52.16 73.41 87.25 20934 20850 22023 14 15 14
ILUT(2e-2) 45.27 47.72 54.10 22882 22928 23815 12 12 13
ILUT(1e-2) 25.18 21.92 27.31 28441 25965 33054 10 10 9
ILUT(5e-3) 14.51 13.12 13.43 34369 29749 40951 8 8 8
ILUT(1e-3) 2.85 2.93 2.69 51191 43675 68955 5 5 5
ILUT(5e-4) 1.87 1.97 1.94 59226 50577 82055 4 4 4
ILUT(1e-4) 1.20 1.14 1.14 76381 65779 109674 3 3 3
59
Table 1.12
cross_dom_0


A 217.61 217.61 217.61 3035 3035 3035 50 50 50
ILU(0) 69.48 49.40 67.14 3035 3035 3035 12 10 13
Jacobi 188.68 188.68 188.68 313 313 313 46 46 46
ILUT(5e-1) 188.68 188.68 188.68 313 313 313 46 46 46
ILUT(4e-1) 188.68 188.68 188.68 313 313 313 46 46 46
ILUT(3e-1) 218.40 186.31 193.91 369 373 371 44 43 43
ILUT(2e-1) 245.05 226.06 255.63 767 765 767 36 35 36
ILUT(1e-1) 117.01 69.07 149.47 1854 1872 1884 15 17 15
ILUT(9e-2) 78.31 76.92 125.04 1909 1943 1937 13 16 14
ILUT(8e-2) 84.40 66.99 116.29 1965 2045 1988 13 15 14
ILUT(7e-2) 80.70 68.82 71.08 2015 2109 2051 12 13 13
ILUT(6e-2) 67.77 53.25 59.73 2095 2157 2180 11 13 13
ILUT(5e-2) 66.51 41.16 64.67 2255 2223 2376 10 12 11
ILUT(4e-2) 38.71 32.47 39.87 2438 2374 2605 10 11 10
ILUT(3e-2) 24.57 31.85 35.52 2624 2636 2837 9 10 9
ILUT(2e-2) 15.59 10.58 21.79 2841 2959 3144 8 7 8
ILUT(1e-2) 6.59 7.32 10.96 3381 3284 4056 7 6 7
ILUT(5e-3) 3.87 4.92 6.65 3863 3623 4900 5 5 5
ILUT(1e-3) 1.58 1.67 1.89 5132 4481 7768 4 4 4
ILUT(5e-4) 1.21 1.32 1.29 5643 4830 9050 3 3 3
ILUT(1e-4) 1.04 1.04 1.06 6437 5576 12085 2 3 2
60
Table 1.13
cross_dom_1


A 326.36 326.36 326.36 5571 5571 5571 61 61 61
ILU(0) 77.37 99.01 125.64 5571 5571 5571 14 12 15
Jacobi 306.65 306.65 306.65 547 547 547 53 53 53
ILUT(5e-1) 306.65 306.65 306.65 550 550 550 53 53 54
ILUT(4e-1) 306.65 306.65 306.65 551 551 551 53 53 54
ILUT(3e-1) 331.36 312.05 327.89 581 587 581 53 52 53
ILUT(2e-1) 379.60 370.76 388.61 1138 1157 1135 48 47 48
ILUT(1e-1) 119.54 101.90 148.78 3433 3487 3449 16 19 16
ILUT(9e-2) 101.34 102.94 144.86 3478 3618 3499 15 18 16
ILUT(8e-2) 94.81 83.37 149.46 3529 3772 3563 15 16 15
ILUT(7e-2) 94.70 73.59 126.77 3597 3855 3672 14 16 15
ILUT(6e-2) 59.89 69.86 161.71 3716 3926 3899 14 15 14
ILUT(5e-2) 47.60 62.75 113.60 4117 4050 4417 13 15 12
ILUT(4e-2) 53.43 70.62 61.25 4626 4295 4958 12 13 11
ILUT(3e-2) 26.05 63.77 50.40 5003 4804 5267 11 11 10
ILUT(2e-2) 20.43 24.85 35.86 5390 5417 5703 10 9 9
ILUT(1e-2) 15.30 11.27 14.21 6782 5960 7724 8 7 7
ILUT(5e-3) 8.04 9.25 10.84 8391 6758 9511 7 6 6
ILUT(1e-3) 2.04 2.26 2.32 13619 8896 15937 4 4 4
ILUT(5e-4) 1.58 1.55 1.47 16090 9943 19016 4 4 4
ILUT(1e-4) 1.11 1.12 1.08 21761 12102 26409 3 3 3
61
Table 1.14
cross_dom_2


A 590.76 590.76 590.76 10205 10205 10205 80 80 80
ILU(0) 292.00 144.32 271.82 10205 10205 10205 18 17 21
Jacobi 621.94 621.94 621.94 973 973 973 72 72 72
ILUT(5e-1) 621.94 621.94 621.94 973 973 973 72 72 72
ILUT(4e-1) 621.94 621.94 621.94 973 973 973 72 72 72
ILUT(3e-1) 582.43 581.79 581.33 1009 1012 1009 70 70 70
ILUT(2e-1) 683.70 710.46 689.71 2160 2157 2167 59 59 60
ILUT(1e-1) 436.63 225.40 353.52 6314 6388 6297 20 27 22
ILUT(9e-2) 467.78 244.65 396.78 6393 6590 6381 19 25 22
ILUT(8e-2) 333.59 233.03 386.31 6477 6850 6510 18 23 21
ILUT(7e-2) 272.94 172.34 282.35 6616 7037 6725 17 22 20
ILUT(6e-2) 222.04 161.58 318.11 6872 7200 7174 18 21 18
ILUT(5e-2) 266.25 196.77 304.10 7482 7436 8030 16 20 17
ILUT(4e-2) 120.76 137.44 158.40 8237 7874 9002 14 18 15
ILUT(3e-2) 85.80 179.24 157.15 8900 8957 9839 13 16 13
ILUT(2e-2) 61.83 67.85 105.79 9675 10139 10827 11 12 12
ILUT(1e-2) 27.03 32.18 32.47 11831 11443 14454 10 10 9
ILUT(5e-3) 16.42 20.64 35.30 13934 13313 18201 8 8 8
ILUT(1e-3) 3.35 4.54 4.84 20211 18537 31501 5 5 5
ILUT(5e-4) 2.28 2.37 2.88 23154 21026 38091 4 4 4
ILUT(1e-4) 1.15 1.27 1.25 29217 26772 56069 3 3 3
62
Table 1.15
cross_dom_3


A 1226.96 1226.96 1226.96 24100 24100 24100 117 117 117
ILU(0) 419.58 304.25 453.44 24100 24100 24100 25 22 29
Jacobi 1143.60 1143.60 1143.60 2230 2230 2230 97 97 97
ILUT(5e-1) 1143.60 1143.60 1143.60 2230 2230 2230 97 97 97
ILUT(4e-1) 1143.60 1143.60 1143.60 2243 2243 2243 97 96 97
ILUT(3e-1) 1214.53 1171.94 1214.53 2316 2322 2311 95 95 95
ILUT(2e-1) 1517.22 1584.08 1407.88 3423 3506 3422 95 93 95
ILUT(1e-1) 595.86 266.52 869.64 14845 15386 14867 27 36 29
ILUT(9e-2) 596.07 256.49 872.03 14926 15719 14961 27 33 28
ILUT(8e-2) 596.34 240.60 878.17 15030 16543 15112 27 28 28
ILUT(7e-2) 616.63 245.91 765.31 15172 16785 15399 27 27 28
ILUT(6e-2) 570.65 245.95 570.42 15617 16922 16882 26 27 25
ILUT(5e-2) 533.97 283.08 736.94 17202 17618 19403 24 26 23
ILUT(4e-2) 205.96 262.45 317.64 19714 18697 22031 21 25 20
ILUT(3e-2) 169.57 185.22 225.60 21235 21482 23070 19 20 18
ILUT(2e-2) 126.32 87.13 145.39 23010 23574 24795 17 16 17
ILUT(1e-2) 50.77 69.68 80.13 28701 26565 34858 14 14 13
ILUT(5e-3) 45.98 39.64 42.97 35285 31206 44128 11 11 11
ILUT(1e-3) 8.66 8.18 9.62 55292 48457 83225 7 7 6
ILUT(5e-4) 4.24 4.10 3.99 65713 57115 104197 5 5 5
ILUT(1e-4) 1.54 1.52 1.61 88395 77728 158287 4 4 4
63
Table 1.16
cross_dom_4


A 43341 43341 43341 146 146 146
ILU(0) 43341 43341 43341 31 29 37
Jacobi 3949 3949 3949 123 123 123
ILUT(5e-1) 3949 3949 3949 123 123 123
ILUT(4e-1) 3970 3970 3970 124 124 124
ILUT(3e-1) 4040 4043 4038 122 121 121
ILUT(2e-1) 5435 5533 5458 121 120 121
ILUT(1e-1) 26699 27872 26737 36 47 37
ILUT(9e-2) 26782 28448 26849 35 44 37
ILUT(8e-2) 26866 30035 26948 35 36 37
ILUT(7e-2) 27022 30230 27348 35 35 36
ILUT(6e-2) 27770 30406 29994 34 35 32
ILUT(5e-2) 30574 31553 34794 32 34 30
ILUT(4e-2) 35455 33396 39623 28 33 26
ILUT(3e-2) 38231 38568 41239 25 26 23
ILUT(2e-2) 41734 42753 44295 22 20 21
ILUT(1e-2) 51852 48005 62605 18 18 16
ILUT(5e-3) 64274 56954 79275 14 14 14
ILUT(1e-3) 105425 91653 155658 8 9 7
ILUT(5e-4) 128976 109712 200863 7 7 6
ILUT(1e-4) 186817 159421 330921 4 5 4
64
Table 1.17
two_dom_0


A 153.97 153.97 153.97 3332 3332 3332 49 49 49
ILU(0) 44.06 21.30 44.89 3332 3332 3332 11 10 13
Jacobi 141.03 141.03 141.03 351 351 351 42 42 42
ILUT(5e-1) 141.03 141.03 141.03 351 351 351 42 42 42
ILUT(4e-1) 141.03 141.03 141.03 354 354 354 42 42 42
ILUT(3e-1) 144.81 145.42 150.35 393 395 393 42 42 42
ILUT(2e-1) 210.49 179.76 207.36 723 735 728 37 35 37
ILUT(1e-1) 62.59 37.99 55.00 2151 2162 2151 13 15 14
ILUT(9e-2) 63.86 35.82 53.64 2194 2229 2176 13 15 14
ILUT(8e-2) 65.85 40.65 56.26 2218 2300 2202 12 14 14
ILUT(7e-2) 60.05 42.02 52.00 2267 2359 2280 12 13 13
ILUT(6e-2) 55.34 39.49 52.01 2334 2403 2407 11 13 12
ILUT(5e-2) 28.58 35.02 30.06 2515 2489 2650 11 12 11
ILUT(4e-2) 15.39 26.84 34.49 2749 2623 2967 9 11 10
ILUT(3e-2) 14.78 18.58 20.15 2957 2939 3162 9 9 9
ILUT(2e-2) 10.17 9.23 17.15 3159 3248 3426 8 7 8
ILUT(1e-2) 5.40 4.57 5.96 3752 3540 4462 7 6 7
ILUT(5e-3) 3.12 2.69 3.38 4247 3939 5401 6 5 5
ILUT(1e-3) 1.36 1.32 1.48 5524 4787 8270 4 4 4
ILUT(5e-4) 1.17 1.15 1.27 6022 5189 9448 3 3 3
ILUT(1e-4) 1.03 1.03 1.03 7008 6150 11223 2 3 2
65
Table 1.18
two_dom_1


A 342.46 342.46 342.46 7554 7554 7554 66 66 66
ILU(0) 96.96 60.66 101.78 7554 7554 7554 15 14 18
Jacobi 304.21 304.21 304.21 749 749 749 59 59 59
ILUT(5e-1) 304.21 304.21 304.21 749 749 749 59 59 59
ILUT(4e-1) 304.21 304.21 304.21 749 749 749 59 59 59
ILUT(3e-1) 314.05 307.39 330.18 770 771 768 59 59 59
ILUT(2e-1) 411.69 364.72 400.52 1593 1609 1623 52 50 48
ILUT(1e-1) 206.20 95.02 179.36 4751 4798 4755 18 22 20
ILUT(9e-2) 210.69 91.95 161.49 4803 4979 4815 17 21 19
ILUT(8e-2) 172.68 101.55 163.66 4877 5176 4891 17 19 18
ILUT(7e-2) 148.49 96.05 174.94 4980 5303 5045 16 18 17
ILUT(6e-2) 114.51 87.49 146.97 5156 5399 5392 16 17 16
ILUT(5e-2) 100.01 87.04 106.32 5624 5562 6041 14 17 15
ILUT(4e-2) 74.84 67.62 89.14 6204 5917 6692 13 15 13
ILUT(3e-2) 39.95 48.96 50.93 6658 6696 7227 12 13 12
ILUT(2e-2) 22.99 26.75 32.10 7196 7511 7866 10 10 11
ILUT(1e-2) 11.41 13.28 14.37 8645 8325 10498 9 9 8
ILUT(5e-3) 8.98 7.69 9.45 10037 9525 12703 7 7 7
ILUT(1e-3) 2.04 2.25 2.45 14037 12403 20975 5 5 5
ILUT(5e-4) 1.46 1.60 1.72 15661 13827 25036 4 4 4
ILUT(1e-4) 1.09 1.08 1.10 18913 16976 32415 3 3 3
66
Table 1.19
two_dom_2


A 593.38 593.38 593.38 16406 16406 16406 92 92 92
ILU(0) 187.79 132.20 214.51 16406 16406 16406 21 18 23
Jacobi 652.10 652.10 652.10 1559 1559 1559 83 83 83
ILUT(5e-1) 652.10 652.10 652.10 1559 1559 1559 83 83 83
ILUT(4e-1) 652.10 652.10 652.10 1570 1570 1570 84 83 84
ILUT(3e-1) 696.23 663.97 657.06 1643 1644 1641 76 80 73
ILUT(2e-1) 756.80 739.64 782.55 2867 2883 2855 78 77 79
ILUT(1e-1) 412.75 229.76 391.23 10106 10413 10154 24 29 25
ILUT(9e-2) 265.75 235.51 392.87 10185 10733 10272 23 27 24
ILUT(8e-2) 382.04 292.81 507.99 10281 11190 10384 22 25 23
ILUT(7e-2) 268.44 248.23 362.58 10441 11447 10684 22 23 23
ILUT(6e-2) 340.24 176.57 410.44 10914 11604 11553 21 22 21
ILUT(5e-2) 257.47 152.27 312.33 12143 11994 13076 19 21 18
ILUT(4e-2) 115.61 144.00 184.67 13716 12683 14774 18 20 16
ILUT(3e-2) 80.21 186.68 104.33 14792 14428 15551 16 17 15
ILUT(2e-2) 62.85 59.92 81.66 15869 16041 16853 14 13 14
ILUT(1e-2) 47.91 33.35 42.56 20236 17881 23442 12 11 11
ILUT(5e-3) 18.08 23.34 18.86 24939 20426 29236 10 9 9
ILUT(1e-3) 3.94 4.33 4.78 42262 29173 52184 6 6 6
ILUT(5e-4) 2.18 2.08 3.01 51578 33530 64084 5 5 5
ILUT(1e-4) 1.22 1.26 1.45 70920 43570 91577 3 4 3
67
Table 1.20
two_dom_3


A 911.00 911.00 911.00 26241 26241 26241 112 112 112
ILU(0) 325.57 202.54 450.86 26241 26241 26241 25 21 29
Jacobi 867.11 867.11 867.11 2448 2448 2448 94 94 94
ILUT(5e-1) 867.11 867.11 867.11 2448 2448 2448 94 94 94
ILUT(4e-1) 869.16 869.16 870.86 2475 2475 2475 94 94 94
ILUT(3e-1) 875.73 875.73 875.73 2517 2519 2517 93 93 94
ILUT(2e-1) 1091.74 1024.02 1171.24 3799 3839 3778 88 85 85
ILUT(1e-1) 429.13 295.32 1049.05 16188 16805 16243 28 35 29
ILUT(9e-2) 429.89 298.48 1049.32 16280 17190 16321 28 34 29
ILUT(8e-2) 430.54 316.66 1052.07 16405 18114 16476 28 28 29
ILUT(7e-2) 449.01 317.54 478.93 16567 18402 16888 28 27 28
ILUT(6e-2) 423.82 306.86 453.78 17086 18544 18363 28 26 25
ILUT(5e-2) 398.97 238.50 528.04 18758 19278 21098 25 25 23
ILUT(4e-2) 150.28 176.38 248.10 21492 20496 24045 22 24 20
ILUT(3e-2) 103.01 150.25 207.83 23193 23377 25053 20 20 18
ILUT(2e-2) 86.01 74.64 156.28 25148 25512 26964 18 16 17
ILUT(1e-2) 41.30 56.72 70.86 31399 28715 37720 15 14 13
ILUT(5e-3) 34.68 31.43 30.90 38386 33107 46723 12 12 11
ILUT(1e-3) 6.89 6.12 7.41 59467 50483 85977 7 7 6
ILUT(5e-4) 2.90 3.70 4.51 70458 58708 107980 6 6 5
ILUT(1e-4) 1.34 1.34 1.63 93539 78879 162160 4 4 4
68
Table 1.21
two_dom_4


A 47939 47939 47939 146 146 146
ILU(0) 47939 47939 47939 32 31 38
Jacobi 4387 4387 4387 134 134 134
ILUT(5e-1) 4387 4387 4387 134 134 134
ILUT(4e-1) 4391 4391 4391 134 134 134
ILUT(3e-1) 4445 4448 4445 129 128 129
ILUT(2e-1) 7448 7615 7490 133 130 134
ILUT(1e-1) 29400 30338 29353 38 51 40
ILUT(9e-2) 29611 31094 29578 37 47 40
ILUT(8e-2) 29845 32584 29910 37 40 39
ILUT(7e-2) 30226 33222 30706 38 39 37
ILUT(6e-2) 31334 33638 33434 35 37 34
ILUT(5e-2) 34409 34955 38115 33 36 31
ILUT(4e-2) 39078 37079 42940 29 34 27
ILUT(3e-2) 42133 42442 45355 26 28 25
ILUT(2e-2) 46221 47218 49582 22 22 23
ILUT(1e-2) 57299 53577 68559 18 18 17
ILUT(5e-3) 70066 63575 85997 14 15 14
ILUT(1e-3) 112126 100572 158884 9 9 8
ILUT(5e-4) 135470 119510 203475 7 8 7
ILUT(1e-4) 188559 169342 325391 5 5 4
69
Table 1.22
two_dom_5


A 111647 111647 111647 204 204 204
ILU(0) 111647 111647 111647 47 42 56
Jacobi 10074 10074 10074 167 167 167
ILUT(5e-1) 10074 10074 10074 167 167 167
ILUT(4e-1) 10129 10129 10129 168 167 168
ILUT(3e-1) 10252 10254 10251 165 165 165
ILUT(2e-1) 13205 13369 13184 190 190 191
ILUT(1e-1) 68486 71823 68572 53 71 55
ILUT(9e-2) 68713 73041 68820 52 67 55
ILUT(8e-2) 68920 77217 69180 52 54 55
ILUT(7e-2) 69411 77986 70594 52 52 54
ILUT(6e-2) 71148 78407 77950 51 51 48
ILUT(5e-2) 78269 82026 89993 47 50 44
ILUT(4e-2) 91319 87029 102790 41 48 37
ILUT(3e-2) 98250 99983 106316 37 38 35
ILUT(2e-2) 107307 109151 114544 34 30 33
ILUT(1e-2) 135091 124285 162505 28 26 24
ILUT(5e-3) 169731 146435 205679 21 21 19
ILUT(1e-3) 283448 248270 403349 13 13 11
ILUT(5e-4) 353911 302134 525275 10 11 9
ILUT(1e-4) 533325 461340 913711 6 7 6
70
Table 2 15 Ordering Schemes and Their Effects on GMRES Iterations
Table 2.1
Test: A 01 02 03 04 05 06 07 08 09 10 11* 12* 13* 14*

two_hole_0
ILU(0) 13 14 13 14 14 20 11 22 14 15 13 19 11 21 17
ILUT(1.0e-1) 15 15 15 15 15 20 18 21 19 15 14 18 15 20 17
ILUT(7.5e-2) 14 14 13 14 14 16 15 18 17 14 13 16 14 17 15
ILUT(5.0e-2) 12 12 11 12 11 14 13 13 14 11 11 12 11 12 12
ILUT(2.5e-2) 9 9 8 9 8 9 9 9 10 9 8 9 9 8 9
ILUT(1.0e-2) 7 7 7 7 6 7 7 6 7 7 7 7 7 7 6
ILUT(7.5e-3) 7 6 6 6 6 6 6 6 6 6 6 6 6 6 6
ILUT(5.0e-3) 6 6 5 6 5 5 6 5 5 6 6 5 6 6 5
ILUT(2.5e-3) 5 5 4 5 4 4 4 4 4 5 4 5 5 5 4
ILUT(1.0e-3) 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
two_hole_2
ILU(0) 18 22 18 21 20 29 16 34 20 21 19 28 16 32 24
ILUT(1.0e-1) 21 23 21 22 22 28 26 33 29 22 21 26 22 30 22
ILUT(7.5e-2) 20 21 20 21 20 25 21 27 24 21 20 23 22 25 20
ILUT(5.0e-2) 17 16 16 16 16 20 19 20 22 16 15 18 20 17 19
ILUT(2.5e-2) 12 13 13 13 12 14 13 14 14 13 12 12 14 10 12
ILUT(1.0e-2) 10 9 10 10 9 9 10 8 10 10 9 9 10 9 9
ILUT(7.5e-3) 9 9 9 9 8 8 9 8 9 9 9 8 9 9 9
ILUT(5.0e-3) 8 7 7 8 7 7 8 7 8 8 8 7 8 7 7
ILUT(2.5e-3) 7 6 6 6 6 5 7 6 6 6 6 6 7 6 6
ILUT(1.0e-3) 5 5 5 5 5 4 5 4 5 5 5 5 5 4 5
two_hole_3
ILU(0) 25 32 29 31 30 45 24 47 26 27 24 37 23 46 34
ILUT(1.0e-1) 30 32 30 31 30 42 38 50 42 27 26 36 32 43 33
ILUT(7.5e-2) 29 30 30 29 27 39 30 40 34 27 26 35 30 35 28
ILUT(5.0e-2) 25 23 23 23 22 32 28 31 32 22 19 27 27 22 24
ILUT(2.5e-2) 19 18 18 18 18 21 19 26 19 19 18 17 20 14 18
ILUT(1.0e-2) 14 13 14 13 14 12 15 11 14 13 14 12 13 12 13
ILUT(7.5e-3) 13 12 13 11 13 11 13 11 12 12 13 12 13 12 12
ILUT(5.0e-3) 11 11 11 11 11 9 12 10 10 10 11 9 12 10 11
ILUT(2.5e-3) 10 9 10 9 9 7 10 7 9 8 9 8 10 8 9
ILUT(1.0e-3) 7 6 7 6 7 6 7 6 7 6 7 6 7 6 6
71
Table 2.2
Test: A 01 02 03 04 05 06 07 08 09 10 11* 12* 13* 14*

four_hole_1
ILU(0) 10 12 11 13 12 18 10 20 12 13 11 16 10 17 14
ILUT(1.0e-1) 13 13 13 14 13 18 15 19 17 14 13 16 13 18 14
ILUT(7.5e-2) 12 13 12 13 12 15 13 16 14 13 11 15 12 15 13
ILUT(5.0e-2) 11 10 10 10 10 12 12 13 12 10 9 10 11 11 10
ILUT(2.5e-2) 8 8 7 8 8 9 8 9 9 8 7 8 8 8 8
ILUT(1.0e-2) 6 6 5 6 5 6 6 6 6 6 6 6 6 6 6
ILUT(7.5e-3) 6 6 5 5 5 5 5 5 5 6 5 5 5 5 5
ILUT(5.0e-3) 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
ILUT(2.5e-3) 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
ILUT(1.0e-3) 3 3 3 3 3 3 3 3 3 3 3 3 4 3 3
four_hole_2
ILU(0) 14 16 13 16 15 21 12 24 15 16 13 19 11 23 18
ILUT(1.0e-1) 17 18 17 18 17 23 20 24 21 18 16 20 17 22 18
ILUT(7.5e-2) 14 15 14 15 15 20 17 21 19 16 15 17 16 19 16
ILUT(5.0e-2) 13 12 12 12 12 14 15 15 16 12 12 13 13 13 13
ILUT(2.5e-2) 9 9 9 10 9 11 10 11 11 10 9 10 10 9 10
ILUT(1.0e-2) 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
ILUT(7.5e-3) 7 7 6 7 6 6 6 6 6 7 6 6 7 6 6
ILUT(5.0e-3) 6 6 6 6 6 6 6 5 5 6 6 6 6 6 5
ILUT(2.5e-3) 5 5 5 5 5 4 5 4 5 5 5 5 5 4 5
ILUT(1.0e-3) 4 4 4 4 4 4 4 4 4 4 4 3 4 3 4
four_hole_3
ILU(0) 20 22 19 22 21 29 17 35 21 22 20 27 17 33 26
ILUT(1.0e-1) 23 24 23 24 23 31 28 37 30 24 23 28 24 32 25
ILUT(7.5e-2) 22 21 21 22 20 27 23 28 26 22 21 23 23 25 22
ILUT(5.0e-2) 19 17 18 17 17 24 19 23 22 18 18 19 20 18 19
ILUT(2.5e-2) 14 14 13 14 13 15 14 18 15 14 13 14 13 12 14
ILUT(1.0e-2) 10 9 10 9 10 9 10 9 10 10 10 10 10 9 9
ILUT(7.5e-3) 9 9 9 9 9 9 9 8 9 9 9 9 9 9 9
ILUT(5.0e-3) 8 7 8 7 8 7 8 7 7 7 8 7 8 7 8
ILUT(2.5e-3) 7 6 7 6 7 6 7 5 6 6 7 6 7 6 6
ILUT(1.0e-3) 5 4 5 5 5 4 5 4 5 4 5 4 5 4 5
72
Table 2.3
Test: A 01 02 03 04 05 06 07 08 09 10 11* 12* 13* 14*

cross_dom_2
ILU(0) 18 23 19 21 21 30 17 33 20 21 19 28 16 31 25
ILUT(1.0e-1) 20 24 22 22 22 29 27 32 29 22 22 26 23 30 24
ILUT(7.5e-2) 18 21 20 20 20 25 22 27 25 19 19 23 21 24 21
ILUT(5.0e-2) 16 16 17 17 16 20 20 20 22 17 16 16 19 17 18
ILUT(2.5e-2) 12 12 13 13 12 13 13 12 15 13 12 12 13 11 12
ILUT(1.0e-2) 10 9 9 9 9 9 10 9 10 10 9 9 10 9 9
ILUT(7.5e-3) 9 8 9 9 8 8 9 8 9 9 8 8 9 8 8
ILUT(5.0e-3) 8 8 7 8 7 7 8 7 7 7 8 7 8 7 7
ILUT(2.5e-3) 7 6 6 6 6 6 7 6 6 6 6 6 6 6 6
ILUT(1.0e-3) 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
cross_dom_3
ILU(0) 25 31 26 29 28 42 22 48 26 28 25 38 21 45 34
ILUT(1.0e-1) 27 30 28 29 29 40 36 49 41 27 27 33 29 39 30
ILUT(7.5e-2) 27 29 28 28 29 36 27 38 30 26 27 32 29 33 29
ILUT(5.0e-2) 24 23 23 23 21 31 26 30 29 21 18 27 29 20 24
ILUT(2.5e-2) 18 18 18 18 17 21 16 24 16 19 17 17 19 14 17
ILUT(1.0e-2) 14 13 13 13 13 13 14 11 13 13 13 12 14 12 12
ILUT(7.5e-3) 13 12 13 12 12 11 12 10 11 11 12 11 12 12 12
ILUT(5.0e-3) 11 10 10 11 10 9 11 10 10 10 10 9 11 10 10
ILUT(2.5e-3) 9 9 9 9 9 7 9 7 8 8 9 8 9 8 8
ILUT(1.0e-3) 7 6 6 6 6 6 7 6 6 6 6 6 7 6 6
cross_dom_4
ILU(0) 31 41 35 37 37 55 29 61 34 35 32 49 27 57 44
ILUT(1.0e-1) 36 39 37 37 37 53 47 62 53 35 33 42 38 53 39
ILUT(7.5e-2) 35 38 37 36 37 48 35 48 39 34 33 42 38 43 37
ILUT(5.0e-2) 32 30 30 30 25 39 34 38 38 27 21 35 37 26 32
ILUT(2.5e-2) 24 25 23 23 21 30 21 46 21 24 21 21 24 17 21
ILUT(1.0e-2) 18 16 17 16 17 16 18 14 17 17 17 16 18 16 15
ILUT(7.5e-3) 17 15 16 16 16 14 16 14 14 16 16 15 16 15 15
ILUT(5.0e-3) 14 14 13 14 13 12 14 13 13 13 13 11 14 13 13
ILUT(2.5e-3) 12 11 11 11 11 9 11 9 11 10 11 9 12 10 10
ILUT(1.0e-3) 8 8 8 7 8 7 9 6 8 7 8 7 8 7 8
73
Table 2.4
Test: A 01 02 03 04 05 06 07 08 09 10 11* 12* 13* 14*

two_dom_2
ILU(0) 21 23 22 23 22 33 18 38 22 23 21 31 17 33 28
ILUT(1.0e-1) 24 24 24 25 24 32 29 40 33 24 23 28 25 34 26
ILUT(7.5e-2) 22 23 23 23 23 29 24 31 27 23 22 26 24 28 24
ILUT(5.0e-2) 19 18 17 18 17 24 21 24 24 19 17 21 22 19 20
ILUT(2.5e-2) 15 15 14 15 13 16 14 17 15 14 14 14 15 13 14
ILUT(1.0e-2) 12 11 11 11 11 11 11 10 11 11 11 11 11 11 10
ILUT(7.5e-3) 11 10 10 10 10 10 10 10 10 11 10 10 10 10 10
ILUT(5.0e-3) 10 9 9 9 9 9 9 9 9 9 9 9 9 9 8
ILUT(2.5e-3) 8 8 7 8 7 7 8 7 7 8 7 7 8 7 7
ILUT(1.0e-3) 6 6 6 6 6 5 6 5 6 6 6 6 6 5 6
two_dom_4
ILU(0) 32 42 35 38 37 55 31 54 36 37 34 51 29 58 45
ILUT(1.0e-1) 38 42 39 40 38 56 51 64 54 39 37 46 41 56 42
ILUT(7.5e-2) 38 40 35 37 36 42 40 50 44 37 36 43 39 45 38
ILUT(5.0e-2) 33 32 31 31 28 40 36 39 40 30 27 35 38 31 33
ILUT(2.5e-2) 25 25 23 24 23 32 24 36 25 25 22 23 25 19 23
ILUT(1.0e-2) 18 17 18 17 17 17 18 15 18 17 18 15 18 16 17
ILUT(7.5e-3) 17 16 17 17 17 14 17 14 15 16 17 15 16 15 16
ILUT(5.0e-3) 14 14 14 14 14 13 15 13 14 14 14 13 15 13 14
ILUT(2.5e-3) 12 11 12 11 12 10 12 9 11 11 12 10 12 10 11
ILUT(1.0e-3) 9 8 9 8 9 8 9 8 9 8 9 8 9 8 9
two_dom_5
ILU(0) 47 58 52 56 54 80 42 81 51 53 48 73 41 74 65
ILUT(1.0e-1) 53 55 54 55 54 78 71 94 78 52 52 65 56 76 56
ILUT(7.5e-2) 53 55 54 54 54 69 52 72 58 52 52 65 56 65 54
ILUT(5.0e-2) 47 45 43 44 37 61 50 60 56 37 35 54 55 40 47
ILUT(2.5e-2) 36 37 34 34 32 44 31 109 32 35 32 36 37 25 32
ILUT(1.0e-2) 28 24 27 24 27 22 26 20 26 24 27 22 26 22 24
ILUT(7.5e-3) 25 23 24 23 25 21 24 20 21 22 24 21 23 22 23
ILUT(5.0e-3) 21 20 20 19 20 19 21 18 19 19 20 17 21 18 19
ILUT(2.5e-3) 18 16 17 15 17 13 18 13 15 15 17 14 17 15 16
ILUT(1.0e-3) 13 11 12 11 12 11 13 10 12 11 12 10 12 10 12
74
Table 3 Mesh Partitioning Tests for Parallel ILU
Table 3.1
two_hole_0
Interface % Interface
Processor Nodes
Nodes Nodes
Two Processors
Processor 1 208 27 12.98%
Processor 2 207 31 14.98%
Total 415 58 13.98%
Four Processors
Processor 1 104 32 30.77%
Processor 2 104 27 25.96%
Processor 3 103 31 30.10%
Processor 4 104 29 27.88%
Total 415 119 28.67%
Sixteen Processors
Processor 1 26 6 23.08%
Processor 2 26 13 50.00%
Processor 3 26 18 69.23%
Processor 4 26 21 80.77%
Processor 5 26 20 76.92%
Processor 6 26 13 50.00%
Processor 7 26 22 84.62%
Processor 8 26 17 65.38%
Processor 9 26 17 65.38%
Processor 10 26 23 88.46%
Processor 11 25 15 60.00%
Processor 12 26 13 50.00%
Processor 13 26 22 84.62%
Processor 14 26 17 65.38%
Processor 15 26 18 69.23%
Processor 16 26 16 61.54%
Total 415 271 65.30%
75
Table 3.2
two_hole_2
Processor Nodes
Nodes Nodes
Two Processors
Processor 1 558 44 7.89%
Processor 2 557 52 9.34%
Total 1115 96 8.61%
Four Processors
Processor 1 279 49 17.56%
Processor 2 279 45 16.13%
Processor 3 279 52 18.64%
Processor 4 278 46 16.55%
Total 1115 192 17.22%
Sixteen Processors
Processor 1 69 14 20.29%
Processor 2 70 31 44.29%
Processor 3 70 24 34.29%
Processor 4 70 39 55.71%
Processor 5 70 34 48.57%
Processor 6 69 28 40.58%
Processor 7 70 36 51.43%
Processor 8 70 27 38.57%
Processor 9 69 33 47.83%
Processor 10 70 36 51.43%
Processor 11 70 25 35.71%
Processor 12 70 34 48.57%
Processor 13 70 51 72.86%
Processor 14 69 34 49.28%
Processor 15 69 33 47.83%
Processor 16 70 21 30.00%
Total 1115 500 44.84%
76
Table 3.3
two_hole_4
Processor Nodes
Nodes Nodes
Two Processors
Processor 1 5298 165 3.11%
Processor 2 5299 147 2.77%
Total 10597 312 2.94%
Four Processors
Processor 1 2649 193 7.29%
Processor 2 2649 121 4.57%
Processor 3 2649 121 4.57%
Processor 4 2650 179 6.75%
Total 10597 614 5.79%
Sixteen Processors
Processor 1 663 44 6.64%
Processor 2 662 103 15.56%
Processor 3 662 95 14.35%
Processor 4 662 144 21.75%
Processor 5 662 93 14.05%
Processor 6 663 86 12.97%
Processor 7 662 123 18.58%
Processor 8 662 112 16.92%
Processor 9 662 112 16.92%
Processor 10 663 120 18.10%
Processor 11 662 73 11.03%
Processor 12 662 113 17.07%
Processor 13 663 203 30.62%
Processor 14 662 100 15.11%
Processor 15 663 115 17.35%
Processor 16 662 75 11.33%
Total 10597 1711 16.15%
77
References
[1] W. Arnoldi. "The Principle of Minimized Iterations in the Solution of the Matrix
Eigenvalue Problem." Quarterly of Applied Mathematics. Vol. 9 (1951), pp. 17-29.
[2] S. Balay, W. Gropp, L. McInnes, and B. Smith. The Portable, Extensible Toolkit for
Scientific Computing (PETSc). Version 2.2.1, Code and Documentation, 2004.
Online. Available 4/18/2005: http://www.mcs.anl.gov/petsc.
[3] E. R. Barnes. “An Algorithm for Partitioning the Nodes of a Graph.” SIAM Journal of
Algebraic Discrete Methods. Vol. 3 (1984), No. 4, pp. 541-550.
[4] S. C. Brenner and L. R. Scott. The Mathematical Theory of Finite Element Methods.
Second Edition. Springer. New York 2002.
[5] R. L. Burden and J. D. Faires. Numerical Analysis. Seventh Edition. Brooks/Cole.

Pacific Grove, CA 2001.
[6] X. C. Cai and D. E. Keyes. “Nonlinearly Preconditioned Inexact Newton Algorithms.”

SIAM Journal on Scientific Computing. Vol. 24 (2002), No. 1, pp. 183-200.
[7] E. Cuthill and J. McKee. “Reducing the Bandwidth of Sparse Symmetric Matrices.”
Naval Ship Research and Development Center. ACM/CSC-ER Proceedings of the
1969 24th National Conference, pp. 157-172.
[8] R. Diestel. Graph Theory. Graduate Texts in Mathematics. Springer. New York 2000.
[9] EPCC. “Unstructured Mesh Decomposition.” Online. Available 4/18/2005:

http://www.epcc.ed.ac.uk/computing/training/document_archive/meshdecomp-slides/
MeshDecomp-1.html.
[10] K. A. Gallivan, A. Sameh, and Z. Slatev. "A Parallel Hybrid Sparse Linear System
Solver." Computing Systems in Engineering. Vol. 1 (1990), pp. 183-195.
[11] A. George. "Computer Implementation of the Finite Element Method." Technical Report
STAN-CS-208, Stanford University, Stanford, CA, 1971.
[12] G. Havas and C. Ramsay. "Breadth-First Search and the Andrews-Curtis Conjecture."
International Journal of Algebra and Computation. Vol. 13 (2003), No. 1, pp. 61-68.
[13] D. Hysom and A. Pothen. “Level-based Incomplete LU Factorization: Graph Model and
Algorithms.” Submitted to SIAM Journal on Matrix Analysis and Applications.
November 2002.
[14] M. Jones and P. Plassmann. "Scalable Iterative Solution of Sparse Linear Systems."
78
Parallel Computing. Vol. 20 (1994), pp. 753-773.
[15] G. Karypis and V. Kumar. “A Fast and High Quality Multilevel Scheme for Partitioning
Irregular Graphs.” SIAM Journal on Scientific Computing. Vol. 20 (1998), No. 1,
pp. 359-392.
[16] G. Karypis and V. Kumar. “Parallel Threshold-based ILU Factorization.” University of

Minnesota, Department of Computer Science / Army HPC Research Center,
Technical Report #96-061. 1998.
[17] D. K. Kaushik and D. E. Keyes. “Efficient Parallelization of an Unstructured Grid Solver:

A Memory-Centric Approach.” Proceedings of the International Conference on
Parallel CFD (Istanbul, June 1999). (U. Gulcat & D. R. Emerson, eds.), Istanbul
Technical University Press, pp. 55-67.
[18] C. T. Kelly. Iterative Methods for Linear and Nonlinear Equations. SIAM Frontiers in
Applied Mathematics. Philadelphia 1995.
[19] M. Luby. "A Simple Parallel Algorithm for the Maximal Independent Set Problem."
SIAM Journal on Computing. Vol. 15 (1986), pp. 1036-1053.
[20] J. Mathews. "Jacobi and Gauss-Seidel Iteration." Department of Mathematics of

California State University, Fullerton. Online. Available 4/21/2005:
http://math.fullerton.edu/mathews/n2003/GaussSeidelMod.html
[21] MathWorks, Inc., The. MATLAB. Version 7.0 (R14). Code and Documentation, 2004.
[22] J. Meijerink and H. Van Der Vorst. "An Iterative Solution Method for Linear Systems of
Which the Coefficient Matrix is a Symmetric M-matrix." Mathematics of
Computation. Vol. 31 (1997), pp. 148-162.
[23] G. L. Miller, S. H. Teng, W. Thurston, and S. A. Vavasis. “Automatic Mesh Partitioning.”

Sparse Matrix Computations: Graph Theory Issues and Algorithms. Springer-Verlag,
New York 1993, pp. 57-84.
[24] B. Nour-Omid, A. Raefsky, and G. Lyzenga. “Solving Finite Element Equations on

Concurrent Computers.” In American Society of Mechanical Engineers. 1986, pp.
291-307.
[25] L. Oliker, X. Li, P. Husbands, and R. Biswas. “Effects of Ordering Strategies and
Programming Paradigms on Sparse Matrix Computations.” SIAM Reivew. Vol
44 (2002), No. 3, pp. 373-393.
[26] Pothen and C. J. Fan. “Computing the Block Triangular Form of a Sparse Matrix.”
ACM Transactions on Mathematical Software, Vol. 16 (1990), No. 4, pp 303-324.
79
[27] Y. Saad. "ILUT: A Dual Threshold Incomplete ILU Factorization." Numerical Linear
Algebra with Applications. Vol. 1 (1994), pp. 387-402.
[28] Y. Saad and M. Schultz. "GMRES: A Generalized Minimal Residual Algorithm for Solving
Nonsymmetric Linear Systems." SIAM Journal on Scientific and Statistical
Computing. Vol. 7 (1986), pp. 856-869.
[29] G. Strang and G. Fix. An Analysis of the Finite Element Method. Prentice Hall 1973.
[30] E. Weisstein, et al. "Jacobi Method." MathWorld - A Wolfram Web Resource. Online.
Available 4/19/2005: http://mathworld.wolfram.com/JacobiMethod.html
[31] E. Weisstein, et al. "Successive Overrelaxation Method." MathWorld - A Wolfram Web

Resource. Online. Available 4/19/2005:
http://mathworld.wolfram.com/SuccessiveOverrelaxationMethod.html
[32] D. P. Young, R. G. Melvin, F. T. Johnson, J. E. Bussoletti, L. B. Wigton, and S. S. Samant.

"Application of Sparse Matrix Solvers as Effective Preconditioners." SIAM Journal
on Scientific and Statistical Computing. Vol. 10 (1989), pp. 1186-1199.
[33] J. Zhang. “A Multilevel Dual Reordering Strategy for Robust Incomplete LU Factorization
of Indefinite Matrices.” SIAM Journal on Matrix Analysis and Applications. Vol.
22 (2001), No. 3, pp. 925-947.
80
Vita
Peter S. Hou was born in Taipei, Taiwan on December 28th, 1981. He grew up like any
ordinary kid who loved watching cartoons and disliked math. However, perhaps fate spoke, he
was chosen to be the math teacher’s assistant for five consecutive years. He moved to the
United States in 1997 and attended Langley High School in McLean, Virginia. His interest for
math grew as he participated in many math-related activities and brought home numerous awards.
In 2000, he was accepted into Virginia Tech to study Computer Science. One day he had a
strange feeling about a life without taking any more math classes, so he started to double-major
in Applied and Discrete Mathematics.
Outside of classes, he enjoyed the challenges from the Putnam Math Competition, Virginia
Tech Regional Math Contest, and Mathematical Contest in Modeling. During school, he
tutored part-time at the Math Emporium. Between semesters, he worked for ProfitScience,
LLC in McLean, Virginia as a software developer. He joined the Tae Kwon Do Club and the
Math Club, quickly became one of the leaders and has remained active to this day. In addition,
he served as a webmaster for the Class Program, helped prepare for the Ring Dance, and
participated in a number of other community activities.
In May 2004, one year after he was given the honor of Phi Beta Kappa membership, he
earned his two B.S. degrees in Computer Science and Mathematics, as well as a black belt in
Chung Do Kwan Tae Kwon Do. Right afterwards, he continued his studies in Virginia Tech as
a Master’s student in Mathematics. Under the Five-Year Bachelor-Master program and
guidance from Dr. Jeff Borggaard, he will be completing his degree in May 2005. Following
his graduation, he is set to join Mercer Human Resources Consulting in New York City as an
actuary.
81

Nodal Reordering Strategies To Improve Preconditioning For Finite Element Systems

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Nodal Reordering Strategies To Improve Preconditioning For Finite Element Systems

Uploaded by

Copyright:

Available Formats

NODAL REORDERING STRATEGIES TO IMPROVE

PRECONDITIONING FOR FINITE ELEMENT SYSTEMS

Thesis submitted to the faculty of the

Jeff Borggaard, Chair

April 27, 2005

Keywords: Nodal Reordering Strategy, Preconditioner, Finite Element Method,

Copyright 2005, Peter S. Hou

Jeff Borggaard, Chair

In this study, we consider application of incomplete LU (ILU) preconditioners for finite

Figure 1 Incomplete LU Factorization..........................................................................................33

Table 1 CM v. RCM and ILU(0) v. ILUT.....................................................................................49

In this work, we focus on developing an incomplete LU (ILU) preconditioner for solving

2.1 Finite Element Methods

Consider the Poisson problem: −Δu = f in Ω , and u = 0 on ∂Ω . For all functions v

The problem is then rewritten as a list of n partial differential equations, for 1 ≤ i ≤ n :

2.2 Linear System Solvers

The LU decomposition is a process, based on Gaussian Elimination, to decompose a matrix

In the 4-by-4 case, the factors look like:

Algorithm 2: Forward Substitution

Next, we can solve Ux = y using a backward substitution, which is a reversed version of

Generally, LU decomposition is a more preferred solution to a linear system than x = A−1b .

2.2.2 Iterative Solvers

We introduce four common iterative solvers, in increasing order of sophistication: Jacobi

2.2.2.1 Jacobi Method

From the matrix break-down above, we rewrite Ax = b

2.2.2.2 Gauss-Seidel Method

The Gauss-Seidel method is fairly similar to the Jacobi method [20]:

Note: the Gauss-Seidel method may also be implemented as xn +1 = ( D + U ) −1 (b − Lxn ) if

2.2.2.4 Krylov Subspace Methods: Generalized Minimum Residual (GMRES)

GMRES generates an orthonormal basis explicitly, using a modified Gram-Schmidt

Algorithm 3: Restarted GMRES

2.3.1 Jacobi Preconditioner

⎡ a11 a12 a1n ⎤ ⎡ a11−1 0 0 ⎤

2.3.2 Incomplete LU (ILU) Factorization

Recall that in LU factorization, we factor matrix A

First, we “approximate” the lower and upper triangular factors of matrix A :

Then with hope, A := U −1 L−1 ( A) = U −1 L−1 ( LU ) ≈ I , and

2.3.2.1 Structure-Based ILU( )

The sum rule:

Define n × n sparse matrix Λ with undefined entries

Algorithm 4: Level-assigning phase of ILU( )

2.3.2.2 Threshold-Based ILUT

Algorithm 5: An implementation of ILUT

2.4 Nodal Reordering Strategies for Finite Element Meshes

Given an n-by-n matrix A , we define the bandwidth of A as 1max { i − j ai , j ≠ 0} . In

Algorithm 6: Cuthill-McKee Algorithm

2.4.2 Reverse Cuthill-McKee Algorithm (RCM)

Algorithm 7: Reverse Cuthill-McKee

The quality of an ILU preconditioner (in terms of convergence rate) also

Second, we want to investigate the same problem on parallel computers. In many

4.1 Finite Element Meshes

Mesh Family Mesh Name Total Mesh

4.3 CM and RCM

4.3.1 The Structure

Define an “umbrella region” of a matrix M as follows: M i , j = 0 is in the umbrella

When we perform an incomplete LU factorization, each possible fill-in is caused by a

4.3.2 The Experiments

For threshold-based ILUT, however, RCM is no longer so beneficial. Especially for

4.4 Breadth-First Search Orderings

Algorithm 8: Modified Breadth-First Search

Test # Sort neighbors of node i based on… Number assignment

Although ILU preconditioners can significantly improve the performance of iterative