Professional Documents
Culture Documents
Nodal Reordering Strategies To Improve Preconditioning For Finite Element Systems
Nodal Reordering Strategies To Improve Preconditioning For Finite Element Systems
Peter S. Hou
Master of Science
in
Mathematics
APPROVED:
Peter S. Hou
Mathematics
ABSTRACT
The availability of high performance computing clusters has allowed scientists and
engineers to study more challenging problems. However, new algorithms need to be developed
to take advantage of the new computer architecture (in particular, distributed memory clusters).
Since the solution of linear systems still demands most of the computational effort in many
problems (such as the approximation of partial differential equation models) iterative methods
and, in particular, efficient preconditioners need to be developed.
I wish to express my deepest appreciation for my advisor, Dr. Jeff Borggaard. You
provided me with research opportunity, guidance, as well as financial support. Without you, I
would not even have known what or how fascinating scientific computing is. When I
encountered difficulties during my research, your encouraging words always refueled my energy
and kept me motivated.
I would also like to thank Dr. Traian Iliescu and Dr. Serkan Gugercin for being on my
thesis committee. I understand that this is an entirely extra task that you willingly committed to
amidst your busy schedules. Your expertise has led me further into exploring this field of
mathematics.
In addition, I must acknowledge the two people who shaped my life as a mathematician,
although they taught me nothing about advanced mathematics. Thank you for talking me into
becoming a math major, Dr. Lee Johnson. Your faith in me gave me the strength to come this far.
Oh and I will never forget the torturous math trainings that you imposed on me, Ms. Jing-Lan Fu.
I used to hate them so much, but how could I have possibly become good with numbers
otherwise?
iv
Table of Contents
Acknowledgements........................................................................................................................ iv
List of Figures ................................................................................................................................ vi
List of Tables................................................................................................................................. vii
Chapter 1 Introduction ....................................................................................................................1
Chapter 2 Literature Overview .......................................................................................................2
2.1 Finite Element Methods...................................................................................................2
2.2 Linear System Solvers .....................................................................................................3
2.2.1 LU Decomposition...............................................................................................3
2.2.2 Iterative Solvers ...................................................................................................5
2.2.2.1 Jacobi Method.............................................................................................6
2.2.2.2 Gauss-Seidel Method ..................................................................................6
2.2.2.3 Successive Over Relaxation (SOR) ............................................................7
2.2.2.4 Krylov Subspace Methods: Generalized Minimum Residual (GMRES) ..7
2.3 Preconditioners ................................................................................................................9
2.3.1 Jacobi Preconditioner.........................................................................................10
2.3.2 Incomplete LU (ILU) Factorization...................................................................10
2.3.2.1 Structure-Based ILU( A ) ...........................................................................11
2.3.2.2 Threshold-Based ILUT .............................................................................13
2.4 Nodal Reordering Strategies for Finite Element Meshes...............................................14
2.4.1 Cuthill-McKee Algorithm..................................................................................15
2.4.2 Reverse Cuthill-McKee Algorithm (RCM) .......................................................16
Chapter 3 Problem Description.....................................................................................................17
Chapter 4 Numerical Experiments................................................................................................18
4.1 Finite Element Meshes...................................................................................................18
4.2 ILU(0) and ILUT ...........................................................................................................19
4.2.1 ILU(0) ................................................................................................................19
4.2.2 ILUT ..................................................................................................................19
4.2.3 Comparisons ......................................................................................................20
4.3 CM and RCM.................................................................................................................21
4.3.1 The Structure......................................................................................................21
4.3.2 The Experiments ................................................................................................22
4.4 Breadth-First Search Orderings .....................................................................................24
Chapter 5 The Parallel Case..........................................................................................................27
5.1 Onto a Parallel Computer...............................................................................................27
5.2 The Reordering Scheme.................................................................................................28
5.3 ILU Analysis ..................................................................................................................29
5.4 A Partitioning Test..........................................................................................................30
5.5 Other Partitioning Considerations .................................................................................31
Chapter 6 Conclusions ..................................................................................................................32
References......................................................................................................................................78
Vita.................................................................................................................................................81
v
List of Figures
vi
List of Tables
vii
Chapter 1 Introduction
Many computational problems in science or engineering require the solution of large sparse
linear systems [6, 10, 14]. These systems have the form of finding an n-dimensional vector x
such that
Ax = b
where A is an n-by-n matrix and b is the n-dimensional right hand side vector. Due to the
challenge in solving these problems and their importance in real-world modeling and analysis, a
wide class of numerical algorithms has been developed to solve them. These algorithms are
very specialized, taking advantage of problem structure and computer architecture. A popular
class of algorithms is based on Krylov subspaces [18, 28]. This is an iterative method that,
under certain conditions (good problem conditioning, good initial guess, appropriate parameter
tuning, etc.) is much more efficient than direct methods. It also has the advantage that it lends
itself to parallel implementations. Thus, Krylov subspace methods are a popular choice in high
performance computing applications.
One of the limitations of the iterative methods is the condition number of the matrix A ,
K ( A) = A A−1 ,
where i represents one of the matrix norms (usually the 2-norm). There is a correlation
between the number of iterations the algorithm requires to converge (hence the computation cost)
and the magnitude of the condition number. The closer K ( A) is to 1, the better. The notion
of left preconditioning is to premultiply the linear system above by a matrix P that is a good
approximation to A−1 ,
PAx = Pb ,
such that K ( PA) is closer to 1. The selection of a good preconditioner is critical to
developing a high performance algorithm. This is typically problem dependent, though a
number of popular strategies have emerged. Not many of these have good parallelism.
1
Chapter 2 Literature Overview
This chapter introduces and examines some well-known techniques involved in solving the
linear systems of our interest on a computer. Section 2.1 briefly describes finite element
methods as the origin of our problem. Section 2.2 examines classic linear system solvers, and
leads into more efficient iterative solvers. Section 2.3 discusses preconditioners which
preprocess the linear systems to help iterative solvers converge faster. Lastly, Section 2.4
introduces finite element nodal reordering strategies that can potentially make preconditioners
more effective.
The first step in a finite element method is to discretize the domain Ω of interest into a
finite element mesh, which is an undirected graph with nodes spaced across the domain. The
density of the nodes, or mesh points, may vary depending on the complexity of the subdomains.
Each mesh point is associated with an unknown and a basis function ϕ , which has value 1 at the
mesh point and zero elsewhere.
j =1
∑ ⎡⎣( ∫∫ )
n
∇ϕi ( x, y )∇ϕ j ( x, y )dA u j ⎤ = ∫∫ f ( x, y )ϕi ( x, y )dA
j =1
Ω ⎦ Ω
Define Aij = ∫∫Ω ∇ϕi ( x, y )∇ϕ j ( x, y )dA , xi = ui , and bi = ∫∫Ω f ( x, y )ϕi ( x, y )dA , this can be
represented as a linear system Ax = b , where x is a column vector of unknown values at the
mesh points, while A and b correspond to the left- and right-hand sides of the equation.
2
From here, the complex physical problem has been reduced down to a standard system of
equations.
The finite element methods use basis functions ϕi with local support. As a means to
construct these, the problem domain Ω is partitioned into regular subdomains: e.g., intervals in
1-D, triangles or rectangles in 2-D, and tetrahedrons or bricks in 3-D. Nodes are placed on
vertices and perhaps edges, faces or interiors on which piecewise polynomial bases are generated.
Where there is a natural assignment of unknown numbers in 1-D elements, there are many nodal
ordering choices in higher dimensions. As we discuss below, this ordering has one dramatic
impact on fill-in for direct linear system solvers and this carries over to preconditioners based on
direct solvers. This is the main topic of this research.
The most trivial and primitive method is to find the inverse of A , assuming it exists.
Direct methods such as Gaussian Elimination are easy to understand and implement, and they
produce exact inverses up to finite precision arithmetic. Subsequently, x = A−1 Ax = A−1b
gives us the solution of the system.
This method, however, is rarely used beyond elementary linear algebra class. The reason
is simple: computing the inverse of a matrix is too expensive. Real-world problems can easily
have millions of equations with millions of unknowns. Computing the inverse of such a system
not only requires tremendous computation power, but also can take up unrealistic amount of
memory. In addition, this process is not parallelizable, hence cannot efficiently speed up even
with multiple processors. Therefore, we introduce some linear system solvers that can be
implemented more effectively.
2.2.1 LU Decomposition
3
⎡ A11 A12 A13 A14 ⎤ ⎡ 1 0 0 0 ⎤ ⎡U11 U12 U13 U14 ⎤
⎢A A22 A23 A24 ⎥⎥ ⎢⎢ L21 1 0 0 ⎥⎥ ⎢ 0 U 22 U 23 U 24 ⎥⎥
⎢ 21 = ⎢
⎢ A31 A32 A33 A34 ⎥ ⎢ L31 L32 1 0⎥ ⎢ 0 0 U 33 U 34 ⎥
⎢ ⎥ ⎢ ⎥⎢ ⎥
⎣ A41 A42 A43 A44 ⎦ ⎣ L41 L42 L43 1⎦ ⎣ 0 0 0 U 44 ⎦
One advantage of this factorization is the ability to store L and U in the same matrix to
conserve storage space. Algorithm 1 takes such advantage and factors matrix A in-place.
The output of this algorithm is in the form:
⎡U11 U12 U13 U14 ⎤
⎢L U U 23 U 24 ⎥⎥
LU = ⎢ 21 22
⎢ L31 L32 U 33 U 34 ⎥
⎢ ⎥
⎣ L41 L42 L43 U 44 ⎦
which, when necessary, can be easily broken into two separate matrices. Note that because the
value of the main diagonal of L is known, it need not to be stored in LU .
For j = 1 to n -1
For i = j + 1 to n
α = Ai , j Aj , j
For k = i to n
Ai ,k = Ai ,k − α Aj , k
End
Ai , j = α
End
End
Algorithm 1: An LU decomposition
We see from line 3 in the algorithm that numerical accuracy may be at stake if any entry on
the main diagonal of A becomes very small during the factorization. To improve on this
situation, we could apply optional pivoting: permute the rows of A so that the absolute
maximum element in each column lies on the main diagonal. This rearrangement of equations
does not affect the solution, as long as proper permutation is also applied to x and b .
After matrix A has been decomposed into the appropriate factors, we can substitute L
and U to solve the linear system Ax = b :
Ax = ( LU ) x = L(Ux ) = b
i
Let y = Ux . Note that ∑L y
j =1
ij j = bi for all 1 ≤ i ≤ n , so we can solve Ly = b recursively
4
using a forward substitution algorithm.
For i=1 to n
i −1
1
yi = (bi − ∑ Lij y j ) ( Lii = 1)
Lii j =1
End
In the case when computing the exact solution to a linear system is impossible or infeasible,
iterative solvers can be employed to numerically approximate the true solution x . Let x0 be
an initial guess of the solution x . The iterative solver computes a sequence {xk } , with
xk → x as k → ∞ . The residual vector rk = b − Axk is used to determine how close we are
to the true solution. Obviously, for a well-conditioned problem, rk ≈ 0 when xk is a good
approximation to x . An iterative solver stops when the residual becomes smaller than a
specified threshold, or when a certain number of iterations have been reached without a
convergence. Good iterative solvers aim to converge {xk } quickly and to minimize the norm
of the residual vector.
A = L + D +U .
The matrices L and U are the strictly lower and upper triangular parts of A , and D is
5
the main diagonal. As we discussed above, inverting a triangular matrix can be performed by
forward or backward substitution. Inverting a diagonal matrix simply involves inverting the
diagonal entries. Hence, the inverses appearing below are computationally tractable.
Ax = ( L + D + U ) x = b .
Then, we move L and U to the right hand side to motivate the Jacobi iteration,
( Dxn +1 ) = b − ( L + U ) xn
xn +1 = D −1 (b − ( L + U ) xn ) .
The Jacobi method solves each equation in the linear system independently [30]. It solves
one variable xi at a time while assuming all other variables x remain fixed. It is extremely
parallelizable in nature. Unfortunately, while this simple idea is very easy to implement, it is
very unstable. It works well with diagonally dominant tridiagonal matrices, but its convergence
is not guaranteed otherwise. The matrix D −1 ( L + U ) must have all of the eigenvalues inside
the unit disk (the smaller, the better).
( L + D ) xn +1 = b − Uxn
xn +1 = ( L + D ) −1 (b − Uxn )
.
Note that ( L + D) is a lower triangular matrix, so the iterations can be computed using a
forward substitution (no matrix inversion is necessary). Due to this nature, the computations
are sequential: solving each equation requires the solutions from the previous equations.
Therefore, this algorithm is not parallelizable like Jacobi method. However, it is relatively
more stable, and is applicable to strictly diagonally dominant matrices and symmetric positive
definite matrices.
6
2.2.2.3 Successive Over Relaxation (SOR)
SOR is derived from extrapolating the Gauss-Seidel method, by taking a weighted average
on the two sides of the equal sign [31]:
(ω L + D ) xn +1 = ωb − ωUxn + (1 − ω ) Dxn
xn +1 = (ω L + D ) −1 (ωb − ωUxn + (1 − ω ) Dxn )
0<ω < 2
When ω is chosen properly, this method speeds up convergence rate. The difficult task
is to choose a good value for each specific problem. When ω = 1 , SOR reduces to
Gauss-Seidel. Also, this method fails to converge if ω falls outside of (0, 2).
The Krylov subspace methods are a family of iterative solvers that, unlike the three above,
do not have an iterative matrix. Their implementations are based on the minimization of some
measure of error over the affine space x0 + K k at each iteration k . x0 is the initial iterate
vector and K k is the kth Krylov subspace,
K k = span {r0 , Ar0 ,..., Ak −1r0 } ,
where r0 = b − Ax0 is the initial residual vector.
Many variants of Krylov subspace methods exist, and they possess various strengths and
limitations. Well known versions include the Conjugate Gradient Method, the General
Conjugate Residual Method, and the Minimum Residual Method, whose applications limit to
symmetric positive definite systems, non-symmetric positive definite systems, and symmetric
indefinite systems, respectively [18]. The most popular variant of Krylov subspace methods is
the Generalized Minimum Residual (GMRES) Method [28], due to its applicability to
non-symmetric indefinite systems. We introduce this method here and use it as our iterative
solver in the experiments.
wi = Avi
For k = 1 to i
wi = wi − wi , vk vk
End
vi +1 = wi wi
When applied to the Krylov sequence, this method is known as the Arnoldi Algorithm [1].
7
The inner product coefficients wi , vk and wi are stored in an upper Hessenberg matrix.
Suppose we generated the complete set of an orthonormal basis V , we can represent the
n
solution as x = x0 + ∑ vi yi , where vi are column vectors of V , and yi are scalars chosen to
i =1
n
minimize at each step the norm of the residual vector b − Ax = b − A( x0 + ∑ vi yi ) . In other
i =1
words, this algorithm always converges to the exact solution in at most n iterations, provided
exact arithmetic is used. In practice, however, this fact does not have much value. When n is
large, not only is the number of iterations unaffordable, but the storage required to store V and
H also becomes prohibitively tremendous.
The restarted version of GMRES overcomes this problem. Given a natural number m ≤ n ,
the algorithm stops and “restarts” after m iterations. The intermediate result
m
xm = x0 + ∑ vi yi is used as the new x0 , V and H are cleared from the memory, and the
i =1
whole process repeats from the beginning until convergence is achieved.
Choose x0
Compute r0 = b - Ax0 ; v1 = r0 / r0 ; β1 = r0
For j = 1 to m
Compute w j = Av j
For i = 1 to j
hi , j = w j , vi
w j = w j − hij vi
End
h j +1, j = w j
If h j +1, j = 0 then m = j; exit for
V j +1 = w j h j +1, j
End
Define the (m + 1) × m Hessenberg matrix H m = (hi , j )
Compute ym to minimize β1e1 - H m y
xm = x0 + Vm ym
8
The difficult task in the restarted version of GMRES is to choose an appropriate m .
When m is too small, the algorithm may converge very slowly or fail to converge. When m
is too large, excessive computations and storage make the process unnecessarily expensive.
Unfortunately, the optimal m depends entirely on each particular system, and there is no definite
rule for choosing this number.
2.3 Preconditioners
In nearly every practical example, iterative methods for the original linear system converges
too slowly. Analyses of these algorithms have found that there is a correlation between the
conditioning of the system matrix, measured by the condition number,
K ( A) = A A−1 ,
and the number of iterations (work) required to converge [18]. Thus, the original system must
be preconditioned to improve algorithm performance. However, to be effective, the cost of
solving the preconditioned system (including the cost of preconditioning) should be less than the
cost of solving the original system. In fact, the reduction needs to be dramatic for iterative
methods to be effective.
In this study, we consider left preconditioners. Thus, we “premultiply” both sides of the
equation by a preconditioning matrix P ,
( PA) x = Pb .
The optimal preconditioner would be the inverse of A (although this would never be a practical
preconditioner). The resulting preconditioned matrix would have a condition number of 1 (the
smallest possible).
In general, the larger the condition number is, the harder it is to find a good approximate
inverse for the matrix. The base-b logarithm of K ( A) estimates how many base-b digits are
lost in solving a linear system with matrix A . The convergence of GMRES is bounded by
k
⎛ K ( A) − 1 ⎞
rk ≤ C ⎜
⎜ K ( A) + 1 ⎟⎟ 0
r ,
⎝ ⎠
where rk is the k th residual vector in GMRES [18]. Moreover, the accuracy of any iterative
solution is bounded by
x − xk r
≤ K ( A) k ,
x b
where x is the true solution and xk is the k th approximation to x [5]. Therefore,
reducing the condition number of the system is important for both speed and accuracy of an
iterative solver.
Many preconditioners with different strengths and applications have been developed, and
we examine two: Jacobi Preconditioner [18] and the ILU Preconditioner family [13, 27]. They
are both based on modified versions of two linear system solvers. We shall see how impractical
9
solvers can be transformed into powerful preconditioners.
The Jacobi Preconditioner, as known as the Diagonal Preconditioner, is derived from the
Jacobi Iterative Method. It applies the inverse of the diagonal entries of A to both sides of the
equation, with the hope of reducing the condition number. If matrix A were diagonally
dominant, the inverse of its diagonal may be a good approximation to the inverse of A itself.
It works well on certain diagonally dominant matrices. Like Jacobi Method, this
preconditioning procedure is highly parallelizable but very unstable. Improved versions such as
Block Diagonal Preconditioning are available, but suffer from similar limitations [20].
Define a fill-in to be an initially zero entry in matrix A whose value becomes nonzero as a
result of the basic row operations in the LU factorization. When any Gaussian
Elimination-based algorithm is applied to a sparse matrix, many fill-ins take place and the
resulting product may become very dense. As the number of nonzero entries increase, the
memory requirement increases. When the problem size is large, it can easily become
unbearably expensive to store all the fill-in entries created by LU. Therefore, Incomplete LU
(ILU) Factorization is developed as a practical alternative [22].
A = LU ≈ LU ; L ≈ L ; U ≈ U
The U and L produced by ILU are upper- and lower-triangular matrices “close” to U
and L factors of A , less some or all of the fill-ins. See Figure 1 to visualize ILU’s reduced
memory cost. A major decision in implementing an ILU factorization is to determine which
10
fill-ins to allow, and which to eliminate. Eliminating more fill-ins keeps the factors more sparse,
saving memory space and computing power. On the other hand, allowing more fill-ins keeps
U and L “less different” from U and L , so that U −1 L−1 stays closer to A−1 .
x ≈ Ax = U −1 L−1 ( Ax ) = U −1 L−1b
Instead of being a linear system solver like complete LU, the ILU serves as a preconditioner.
With U −1 L−1 being “close” to A−1 , A would be “close” to the identity matrix. In other
words, it would have a smaller condition number. When an iterative solver is used to solve this
modified system, convergence would be reached faster with higher accuracy. Various ILU
implementations use different theories to eliminate fill-ins, trading off memory requirement for
conditioning quality and vice versa. We shall study the two major families of ILU algorithms:
the structure-based ILU( ) [13] and the threshold-based ILUT [27].
The structure-based ILU( ) implementations allow and deny each fill-in based on its
location relative to the structure of the matrix [13]. The first phase determines the locations of
permissible fill-in entries by assigning each location a level. A fill-in is allowed if its level is
less than or equal to . In the second phase, an LU factorization takes place, using the
“incomplete” fill-in pattern determined in the first phase to keep certain zero entries intact.
In Algorithm 4, the matrix Λ contains the level values for the entries of A . Each level
is a nonnegative integer, and Λ ij ≤ indicates that a fill-in is allowed for Aij in the
factorization. Entries of Λ are initially set to undefined, and some would stay undefined if the
entry is not a possible fill-in (i.e. if the entire column above the entry is zero).
In essence, this algorithm works as follows: if an entry is initially nonzero in A , it has level
0 and no limit is imposed on that entry. If an entry is initially zero, then any possible fill-in at
this location depends on a nonzero entry to its left (the pivot in Gauss Elimination) and a nonzero
entry above it (the row whose multiple adds to this row). Each entry’s level depends on the
levels of the two entries that may be causing its fill-in. Successor entries are considered “less
important” than the predecessor entries and therefore have strictly higher levels. There are two
popular implementations for weighing a level based on its predecessor entries’ levels:
11
The max rule:
computeWeight ( a, b)
{
return(max{a, b} + 1);
}
We are not to compare the strengths of these two rules in this thesis. However, one should
be able to tell that the succession of levels grows faster under the sum rule. Therefore ILU( )
would allow less fill-in entries under the sum rule, given the same .
12
ILU(0) is a special case of the ILU( ) family. Only entries with level 0, or initially
nonzero entries, are allowed to be nonzero after the factorization. In other words, the factors
have the same sparsity pattern as the original matrix A . ILU(0) is a popular method since it is
intuitively predictable, and its factors require the minimum amount of memory space among the
entire ILU( ) family.
Unlike ILU( ), the threshold-based ILUT algorithms maintain the sparsity of a matrix by
controlling the magnitudes of its entries [27]. In the factors, the significance of a nonzero entry
no longer depends on its relative location, but on its absolute value. Imagine that we drop a
certain number of the smallest nonzero entries from L and U , the complete LU factors of A ,
to produce L and U . These incomplete factors are now more sparse, yet still fairly similar to
the complete factors L and U .
In general, ILUT( t ) drops any element whose magnitude is smaller than the threshold.
The threshold is often a number t in [0,1] multiplied by the norm of the active row (or column,
if the algorithm is column-based) in the factorization process. In other words, Lij in the
approximate factorization is replaced by zero if its absolute value is less than t* = t Ai ,* .
For j = 1 to n -1
For i = j + 1 to n
t* = t Ai ,* 2
α = Ai , j Aj , j
For k = i to n
Ai ,k = Ai ,k − α Aj ,k
End
For k = i + 1 to n
If Ai ,k <t* Then Ai ,k = 0
End
Ai , j = α
If Ai , j <t* Then Ai , j = 0
End
End
13
Unlike in structure-based ILU( ), a nonzero entry in the original matrix A does not
guarantee a nonzero entry at the same location in the factors. Note that all entries can be
possibly dropped, except those on the main diagonal are kept intact so the factors still can be
nonsingular. A smaller threshold value keeps the factors closer to complete, and ILUT(0) is
identical to the exact LU factorization. On the other hand, a very large threshold would
eliminate most off-diagonal entries. ILUT(1) produces an identity matrix L and a diagonal
matrix U , which, when used as a preconditioner, is identical to the Jacobi Preconditioner.
The structure-based ILU( ) family and the threshold-based ILUT family are the two major
branches of incomplete LU factorizations. A large number of more sophisticated and robust
ILU algorithms have been developed for different applications [13, 16, 33], and most of them are
based on one of these two basic ideas.
The linear systems in this study are constructed from finite element meshes. Before we
construct the linear system, it is possible for us to assign numbers to the mesh nodes in different
orders, to make the system “easier to process.” The same effect can be achieved by permuting
the rows and columns of the linear system, although permutations of a large system can be
incredibly expensive if they involve the swapping of physical entries.
Depending on how the finite element mesh is created, its nodes may be ordered in a way
that makes computing inefficient. What we refer to as “natural ordering” usually numbers the
nodes by the order in which they enter the system, which may be completely unrelated to the
geometrical structure or connectedness of the mesh. As we shall see, a less intuitive ordering
scheme is often desired for many different reasons.
A typical method for developing parallel algorithms for finite element methods is to
distribute groups of elements to different processors. Therefore, if a node is affiliated with
elements that lie on different processors, then operation with these nodes require passing data
between processors. The cost for inter-processor communication is often very significant.
Hence, there are two issues: The first is optimally partitioning the elements to minimize the
number of nodes that must be shared (and to achieve good load balancing). The second is the
numbering of nodes on the mesh so that the parallel preconditioner is optimal.
Some reordering strategies can conserve computer storage as well as reduce the actual
calculation time, since they influence the performance of some preconditioners. These
strategies and their potential benefits to finite element methods are of particular interest to us.
We introduce the classic Cuthill-McKee Algorithm here as a starter, and go into further analysis
in a later section.
14
2.4.1 Cuthill-McKee Algorithm
In order to solve a large system of equations efficiently, one must conserve computer
storage as well as calculation time. E. Cuthill and J. McKee devised a robust algorithm to
condition sparse symmetric matrices by reordering the nodes [7].
other words, it is a measure of how far the nonzero elements lie from the main diagonal.
Matrices with small bandwidths have several advantages, as we shall see later. The
Cuthill-McKee algorithm is designed to reduce the bandwidth of a matrix. Figure 2 shows an
example of its bandwidth reduction ability.
The basic idea is, that for a sparse matrix A , we want to find a permutation matrix P ,
such that the matrix PAPT permutes the rows and columns of A to “move” the nonzero
elements as close to the main diagonal as possible, hence reducing the bandwidth. In practice,
however, permuting a large matrix would be extremely inefficient, so this algorithm aims to
reorder the nodes on the graph of matrix A prior to the matrix’s construction, effectively
introducing a permuted index set.
Before moving on to the algorithm, we shall review some basic terminology in graph theory.
A graph consists of a finite set of nodes (or vertices) connected by a finite set of edges. In a
weighted graph, each edge is assigned a weight, which is a numerical value. The degree of a
node is the sum of weights of the edges connected to it, and in an unweighted graph it is simply
the number of edges connected to it. Two vertices are adjacent if there is an edge between them.
A path is a sequence of consecutive edges, and two nodes are connected if there is a path from
one to the other. A graph is connected if every node in it is connected to every other node. A
component is a connected subgraph. A circuit is a path which ends at the starting node. A tree
is a graph containing no circuit, and a spanning tree of a graph is a subgraph that is a tree and
contains all of the nodes. For more detailed discussion, see [8].
Given a graph G
Select a node of minimum degree, label it 1
When k nodes have been labeled, 1 ≤ k < n ,
Select the smallest i such that node i has unlabeled neighbors
Locate all of node i ’s unlabeled neighbors ( u1 ,..., um )
In increasing degree order, label these nodes k + 1,..., k + m
Repeat until all n nodes in G have been labeled.
In the event that G has more than one component, this algorithm stops after it labels an
entire component’s m nodes with a tree, m < n . Continue by labeling a node of a minimum
15
degree on another component m + 1 , and repeat until the all nodes in the graph are labeled.
When dealing with finite element meshes that are entirely connected, this algorithm generates a
spanning tree across the entire mesh.
In essence, given a labeled node i in the graph, this algorithm labels its next neighbor with
the number j as close to i as possible. The edge connecting these two nodes becomes Aij
in A , and such nodal ordering keeps i − j small. It is easy to see an upper bound for the
bandwidth of A is 2m − 1 , where m is the maximum number of nodes per level of the
spanning tree generated by the algorithm.
The Cuthill-McKee Algorithm reduces, but does not necessarily minimize bandwidth. For
a specific family of matrices that share certain special properties, it is possible to devise an
algorithm to reduce bandwidth beyond Cuthill-McKee’s ability. Moreover, even for the same
graph, Cuthill-McKee can yield several matrices of different bandwidths, depending on the
starting node and the ordering of equal-degree nodes. See Figure 3 for an example.
Nonetheless, Cuthill-McKee ordering generally provides significant bandwidth improvement
from natural ordering, applies to a wide range of problems, and can be easily automated. Due
to these benefits, this algorithm is widely used in scientific computing.
While the Cuthill-McKee Algorithm is well-known for its ability to reduce the bandwidth,
many preconditioners implement a more popular variation called the Reverse Cuthill-McKee
(RCM) Algorithm [11]. As its name suggests, this variation uses the same ordering pattern but
assigns numbers backwards.
Given a graph G
Select a node of maximum degree, label it n
When k nodes have been labeled, 1 ≤ k < n ,
Select the largest number i such that node i has unlabeled neighbors
Locate node i ’s unlabeled neighbors ( u1 ,..., um )
In decreasing degree order, label these nodes n − k ,..., n − k − m + 1
Repeat until all n nodes in G have been labeled.
Although the algorithms seem very similar, the original Cuthill-McKee and the Reverse
Cuthill-McKee behave rather differently. We shall explore their differences in a later section.
16
Chapter 3 Problem Description
In the 2002 SIAM Review article by Oliker, Li, Husbands, and Biswas, “Effects of Ordering
Strategies and Programming Paradigms on Sparse Matrix Computations,”
This sparked our interest in the relationship between ILU preconditioners and nodal
reordering strategies. Which ILU works best? Which ordering strategy can improve
preconditioning quality? Do different ILU preconditioners perform well with different
reordering schemes? We want to search for efficient methods for solving finite element linear
systems by studying the combined behavior of nodal reordering schemes and ILU
preconditioners.
First, we propose a numerical study for this problem using Matlab. The strengths and
weaknesses of structure-based ILU( ) and threshold-based ILUT are compared using a series of
examples. Next, classic Cuthill-McKee and Reverse Cuthill-McKee algorithms are analyzed.
Then, a detailed numerical study is conducted involving multiple meshes, reordering strategies,
and preconditioners. We seek a trend in reordering-preconditioning pair that can most
efficiently simplify the solution of our linear systems.
17
Chapter 4 Numerical Experiments
The following numerical experiments are carried out with Matlab 7.0 (R14) on a Windows
XP machine with Intel® Pentium® 4 2.02 GHz processor and 1GB of RAM.
The meshes we use are listed below. Many of them use the same domain with several
different mesh densities. Figure 4 illustrates the coarsest and the most refined mesh for each
example. The number of mesh points provides a perspective of the problem size, the larger of
which requires the better algorithms to efficiently speed up calculations.
18
4.2 ILU(0) and ILUT
Matlab implements incomplete LU factorization in two ways: the threshold-based ILUT
method and the special case ILU(0) of the structure-based ILU( ) methods [21]. To use the
ILUT method, the user would call luinc( A , t ), with t being the tolerance, or threshold.
Factors of matrix A are computed, and entries smaller than t times the 2-norm of the column
vectors of A are dropped. Specifying zero tolerance produces factors with no dropped entries;
that is, luinc( A , 0) gives the same result as lu( A ). Another special case with tolerance being 1
drops all entries except those on the main diagonal. On the other hand, calling luinc( A , ‘0’)
results in the structure-based ILU(0). The L and U factors from luinc( A , ‘0’) have nonzero
entries only at locations where A has nonzero entries.
However, the other cases of structure-based ILU( ), where ≥ 1 , are unavailable. Not
only does Matlab lack of their implementation, many researchers also neglect to mention any
ILU( ) beyond ILU(0). In some literatures, it appears as if ILU(0) and ILUT are the only
LU-based preconditioners of importance. The reason is not yet trivial, but for now we should
use the tools readily available to learn more about the two classes of ILU.
4.2.1 ILU(0)
Before we make comparisons, let us review the behavior of ILU(0). Its most obvious
property is the predictable sparsity pattern: L or U has a nonzero entry if and only if A has
a nonzero entry at the same location. On the other hand, while ILUT can produce much denser
or sparser factors, the nonzero elements in the factors do not necessarily overlap those in A .
Figure 5 shows the U factor of ILU(0) and ILUT at three thresholds, superimposed over the
same matrix. Only ILU(0) preserves the original matrix’s sparsity pattern.
This feature can be very important, especially if the memory is limited. Given any square
matrix, we know a priori exactly how much memory is required to store its ILU(0) factors.
Besides, because it allows no fill-in, the dependency among the rows stays constant. This
simplifies parallel implementations, as discussed in the next section.
The danger of ILU(0), or of any structure-based ILU( ) method, is its insensitivity to the
magnitude of fill-ins. Potential LU elements are dropped only based on their locations, and not
values. Therefore, many small, insignificant elements may be preserved rather than large ones.
According to Karypis and Kumar, this can cause preconditioners to be ineffective for matrices
arising in many realistic applications [16].
4.2.2 ILUT
The sparsity pattern of the ILUT factors is unpredictable. We study the effect of the
threshold parameter through numerical experiments. Four matrices are chosen and
preconditioned with ILUT, using a wide range of threshold values between 0 and 1. The
19
condition number of the preconditioned system K (U −1 L−1 A) and the sparsity of the factors
( L + U − I ) are observed. Because we know intuitively that these two values are tradeoffs of
each other, we compare them on the same graph. These values are plotted in Figure 6 where the
log of the condition number is used to enhance the contrast among the smaller values.
As observed commonly among the four graphs, the precondition number and the number of
nonzero elements are roughly inversely related. However, the factors being less sparse do not
always imply a higher preconditioning quality. The local minimums on the precondition curves
indicate the best precondition numbers among all ILUT factors of similar sizes. To most
efficiently solve a large linear system, it would be of the user’s interest to take advantage of such
local minimums. Unfortunately, in practice they cannot be found without excessive
computation.
When the precondition number approaches 1, the ILUT factor sizes approach those of a
complete LU and become extremely memory-inefficient. On the other hand, as the memory
requirement is minimized ( t → 1 ), the precondition number becomes humongous (sometimes
even larger than K ( A) ). As discussed in Chapter 2, the ILUT preconditioner with t = 1 is
identical to the Jacobi (Diagonal) preconditioner. This experiment shows that such
preconditioning really does not help when the matrix is not tridiagonal. As a result of this study,
we see that ILUT is ineffective when the threshold is near either extreme.
4.2.3 Comparisons
In this section, we investigate how the structure-based factorizations differ from the
threshold-based versions in practice. We want to find out how their preconditioning qualities
compare. The answer is revealed by simply adding the statistics from ILU(0) in Figure 6.
On all four of the graphs, the blue circle indicates log of the ILU(0) precondition number,
and the red circle indicates the number of nonzero elements in the ILU(0) factors. The blue
circle being to the left of the red circle suggests that, in order to achieve the same precondition
number, ILUT generates more sparse factors. In other words, when the threshold is chosen so
that ILUT factors are as sparse as ILU(0), the ILUT precondition number is much lower. In all
four cases in this experiment, our findings show that the ILUT method is a more efficient
preconditioner than ILU(0): less memory and smaller precondition number.
Although our findings here really do not support ILU(0), its performance appears to
approach that of ILUT as the matrix size increases. When the matrix size is in the hundreds,
and assuming similar factor sizes, ILUT produces a precondition number that is roughly 1/8 of
the ILU(0) precondition number. When the matrix size goes over ten thousand, that ratio
increases to roughly 1/4. One may wonder as the matrix size grows, whether such ratio
approaches or even exceeds 1, in which case ILU(0) becomes the better preconditioner.
Unfortunately, since calculating the exact precondition number K (U −1 L−1 A) is impractically
resource-consuming, we are unable to carry out the same experiment on much larger matrices.
20
Now, suppose ILU(0)’s weak performance is lack of attention to element magnitudes (as
mentioned above), then we have a hypothetical idea of why structure-based ILU( ) with > 0
is not used in practice. While still possessing the same structure-based magnitude insensitivity
that makes it a weak preconditioner, it loses the clean and obvious sparsity pattern that ILU(0)
has. Therefore, ILU( ) with > 0 has little value beyond theoretical discussions.
Suppose we have matrix F , generated from a finite element mesh with the Cuthill-McKee
reordering, and matrix R , generated from the same mesh with the Reversed Cuthill-McKee
reordering. By casually examining the sparsity patterns of the two matrices, we find that they
are mirror images of each other – about the antidiagonal. Moreover, they have the same
bandwidth. Their only difference seems to be the alignments of the nonzero entries, as
illustrated in Figure 7.
In the Cuthill-McKee Algorithm, from a starting node i , its unassigned neighbors are given
numbers j , j + 1 , j + 2 … with j > i . Hence, in matrix F , many nonzero entries line up
with the pattern Fi , j , Fi , j +1 , Fi , j + 2 … with j > i . On the other hand, in Reverse
Cuthill-McKee, unassigned neighbors of node j are given numbers i , i − 1 , i − 2 … with
i < j . Hence, in matrix R , many nonzero entries line up with the pattern Ri , j , Ri −1, j ,
Ri − 2, j … with i < j . In other words, when we look at the upper triangular half of the matrices,
the nonzero entries in F line up horizontally, whereas the nonzero entries in R line up vertically
(reversed in the lower triangle).
21
nonzero entries above, or lower-triangular positions with no nonzero entries to the left, no fill-in
would ever occur. In other words, fill-ins always happen within the “umbrella region.”
Due to the nonzero alignment patterns, it is easy to see in Figure 8 that matrix F has a
much larger “umbrella region” than matrix R . Hence, more potential fill-ins would/could be
eliminated during ILU(0)/ILUT. Intuitively, the more fill-ins we eliminate, the more our
incomplete factors differ from the actual factors. Having less accurate factors, the ILU
preconditioner produces a less ideal (higher condition number) matrix.
Next, we would like to see the effects of nodal reordering on the ILU preconditioners.
Renumbering the nodes in a finite element mesh is equivalent to permuting the equations and
unknowns of the corresponding linear system. Although the system as a whole and its solutions
stay intact, the physical structure changes and preconditioning qualities could be affected.
We take the four matrices from the previous experiment (Section 4.2.2, Figure 6) and
reorder/permute them using the Reverse Cuthill-McKee algorithm. The ILUT preconditioner is
applied with the same threshold values as in the previous experiment, and the precondition
numbers and sparsity numbers are recorded accordingly. These numbers of the original (natural
ordering) matrices are subtracted from those of the new (RCM ordering) matrices. The
differences are plotted in Figure 9.
When the threshold is large, RCM ordering does not seem to benefit ILUT preconditioning.
The precondition numbers differ dramatically without a fixed direction or pattern. At the same
time, the sparsities of the factors are not affected much, if any. Therefore, when we use ILUT
with large thresholds, the RCM reordering algorithm brings unpredictable effects on the
preconditioning quality while saving no memory. In this case, the reordering is an unnecessary
waste of effort.
On the other hand, RCM ordering does seem to improve ILUT with small thresholds. The
difference in precondition numbers converges to zero as the threshold decreases, and the
difference in sparsity grows apart concurrently. RCM-ordered matrices are preconditioned to
the same quality with smaller ILUT factors. However, based on our experiment, the amount of
sparsity gained is not proportional to the size of the matrix or the size of the threshold.
Comparing graphs (b) and (c) in Figure 9, matrix (c) before preconditioning almost doubles the
size of matrix (b), but the ILUT memory it saves with RCM ordering is less than half, when the
threshold is small. In addition, graph (d) suggests that for this particular matrix, the memory
efficiency caused by RCM is maximized when threshold is slightly less than 10-3, and drops back
to zero as the threshold continues to decrease. In essence, RCM ordering could improve the
preconditioning qualities of ILUT, although the result is not guaranteed. We say “could”
because the red curve does not really start to drop until the threshold reaches below 10-2, in
which case the ILUT factors are larger than ILU(0) factors. When the sparsity difference
becomes more significant is also when the ILUT factors become quite dense. Since the purpose
of incomplete LU is to keep factors sparse, it would probably not be sensible to employ small
22
enough thresholds to see the difference between these two ordering strategies.
Recall that our real goal in preconditioning is to speed up the convergence rate of iterative
solvers, and precondition number is only a rough indication for this. The actual effectiveness of
preconditioners and reordering strategies need to be tested by actually solving the system with an
iterative solver. Tables 1.1~1.22 are the condition number, sparsity, and iterations to
convergence of our matrices (meshes described in Section 4.1), each with original ordering,
Cuthill-McKee ordering, and Reverse Cuthill-McKee ordering. Preconditioners applied are
ILU(0), Jacobi, and ILUT at 18 other threshold levels. The iterative solver used is Matlab’s
implementation of GMRES.
For each preconditioner on each table, the best (i.e. smallest) values are highlighted. At a
quick glance, we find the counterintuitive fact that sometimes a smaller condition number is
associated with a slower convergence! For example, on Table 1.3 (two_hole_0) with
ILUT(5e-2), the RCM produces the most sparse matrix with the smallest condition number, but it
took the solver two more iterations to convergence than matrices of natural or CM orderings. In
general, however, small condition numbers still lead to faster convergence. This shows us that
using the precondition number is a good predictor of the algorithm performance, but the best
predictor is to actually test it in the iterative solver.
For ILU(0) preconditioner, RCM reordering consistently produces the best results – lowest
condition number and, more importantly, fastest convergence. RCM-ordered matrices converge
in up to 27% fewer steps than natural-ordered matrices. On the other hand, CM tends to be a
very poor ordering for ILU(0), giving significantly worse numbers than natural ordering. From
our structural analysis, recall that the classic CM ordering generates a much larger “umbrella
region” than original ordering, in which ILU(0) eliminates many more potential fill-ins and
makes its factors less closer to complete LU. On the other hand, RCM with a smaller “umbrella
region” has an ILU(0) factorization much closer to complete LU.
In contrast, the classic CM ordering seems to suit ILUT fairly well. Especially when the
matrix becomes large and the threshold is relatively small, it yields better convergence rate than
natural or RCM orderings. It is interesting to observe that such CM-ordered matrices often
have the largest precondition numbers, yet they still converge the fastest.
In a nutshell, we learn that the usefulness of each ordering strategy depends entirely on how
it is used. Although nothing is absolute, we do observe a general pattern for best-matching
ordering strategies, preconditioners, and problem sizes. Reverse Cuthill-McKee is consistently
the best ordering scheme for ILU(0), classic Cuthill-McKee ordering is great for ILUT with
23
moderate to small thresholds on large systems, and natural ordering should suffice by itself on
small problems.
The Cuthill-McKee algorithm and its reversed version are essentially two special cases of
BFS, with some special requirements. Since the algorithm was originally developed with
bandwidth reduction in mind, not preconditioning quality, one may wonder if there exist other
BFS-based ordering schemes that better suit our interest.
Given a graph G , suppose the scheme assigns numbers low to high [high to low]
Select a node and label it 1 [ n ]
When k nodes have been labeled, 1 ≤ k < n ,
Select the smallest [largest] i such that node i has unlabeled neighbors
Locate all of node i ’s unlabeled neighbors ( u1 ,..., um )
In specified sorting order, label these nodes k + 1,..., k + m [ n − k ,..., n − k − m + 1 ]
Repeat until all n nodes in G have been labeled.
We devise a test consisting of 14 BFS-based ordering schemes, each with some unique
requirements. There are seven ways to traverse the mesh by sorting a node’s neighboring nodes
differently. With each traversal, one scheme assigns numbers from low to high while another
goes from high to low. CM and RCM are included among these. Below is a table listing the
schemes used in our test.
24
Test07 Physical distance of the neighboring Low to high
Test08 nodes from node i, in descending order. High to low
Test09 Physical distance of the neighboring Low to high
Test10 nodes from node i, in ascending order. High to low
Test11* For neighboring node j, the value of Low to high
Test12* Aij , in descending order High to low
Test13* For neighboring node j, the value of Low to high
Test14* Aij , in ascending order High to low
On this list, Test01 is the most generic version of Breadth-First Search, so the new ordering
highly depends on the existing natural ordering. Test03 and Test06 are classic CM and the
popular RCM, listed here for comparison purposes. The rest are some simple modifications of
the existing ideas, and they only represent a small fraction of possible Breadth-First Search
variants.
Test11 through Test14 are not finite element mesh orderings in the same sense as the others.
Because they are ordered based on the magnitudes of matrix A ’s entries, the orderings are not
available before the matrix is built from the mesh. They can be achieved by matrix
permutations and have a positive effect on ILU preconditioning. However, on computers that
permute by moving physical entries (such as on distributed-data machines), the cost of such
operation could become prohibitive. Disregarding this fact, rearranging a matrix according to
the actual magnitude of its elements might be very practically sensible. If deemed useful, these
schemes can still be applied to time-dependent problems where the cost of building/permuting
the matrix can be sacrificed for a higher efficiency at each time step of solving the problem.
We pick three meshes from each of our four domains, apply the test ordering schemes to
them, precondition using ILU(0) and nine thresholds of ILUT, then record their GMRES
convergence iterations. These numbers are recorded in Tables 2.1 ~ 2.4.
The purpose of this experiment is to find better ordering schemes than what we already
have from the previous experiment. Therefore, we use bold borders for columns A, 03, and 06,
which are natural, CM, and RCM orderings. For each row, the smallest iteration number
among the three orderings is determined. Then, any scheme that produces the same or better
iteration is highlighted green or yellow, respectively. In other words, a green entry means that
the ordering is as good as the best among natural, CM, and RCM; a yellow entry means that it is
better than all of the three. When a column has many green and yellow entries, that particular
ordering is probably what we are looking for.
First, we examine the ILU(0) case. RCM is still the best among orderings – with only one
exception. Test12 consistently produces the same quality matrix as RCM, and sometimes even
better. This permutation scheme has the same property as RCM which produces a small
“umbrella region” because it assigns numbers backwards (high to low). Now, because it
arranges the largest elements of A close to the main diagonal, it minimizes the overall
25
magnitudes of the eliminated fill-ins. While ILU(0) performs its structure-based incomplete
factorization, Test12 permutation gives it some of the threshold-based advantages. Therefore, it
is stronger than the popular RCM ordering.
Next, we look at the ILUT cases. It seems difficult to compare all of the columns at the
first glance, although it is apparent that some of our new ordering schemes are highly
comparable to or even better than natural, CM, and RCM. In the event that the threshold is
small, the ordering schemes which assign numbers forward perform better than their backward
equivalents. However, the distinction is less obvious when the threshold is large. After more
detailed examinations, we find that Test05, Test10, Test11, and Test13 have the best overall
performances. Disregarding Test11 and Test13, which are not true mesh reordering schemes,
there are still two very satisfying results. Test05 is merely a modified Cuthill-McKee algorithm,
with nodes sorted in descending degree order rather than ascending. Test10, on the other hand,
requires a slightly more sophisticated implementation and is more computationally expensive
due to the calculation for the physical distances between nodes.
Our experiment finds some BFS-based nodal reordering schemes that can generate better
numerical results than existing Cuthill-McKee and Reverse Cuthill-McKee algorithms. Our
schemes not only can assist ILU preconditioners and GMRES iterative solver into a faster
convergence, they are also very easy to understand and implement. What we still realize is that
not a single scheme can be perfect, so our choice should be made around each specific problem
that we want to solve and the preconditioner that we wish to use.
26
Chapter 5 The Parallel Case
This parallel ILU factorization algorithm is inseparable from its nodal reordering strategy.
This ordering is much more sophisticated than those in the single-processor cases, and the ILU
techniques only count for a small fraction of this algorithm. Although ILU(0) and ILUT are
both parallelizable based on this theory, the ILUT case is slightly more complicated.
Next, assume that we want to solve this problem on a concurrent computer of four
processors. Then it is not so straightforward. First of all, we need to partition the mesh into
four pieces. This partitioning process is a nontrivial field of study itself. There are direct
k-way methods, recursive methods based on geometry or graph theory, and multi-level methods
[3, 9, 15, 23]. Efficient and robust partitioning algorithms often coarsen the mesh first, partition
using one of the simple methods, and then uncoarsen to the original mesh with optimization and
local refinement processes at each step. Typically, there are two common goals of these
algorithms: 1) partition the mesh into roughly equal sizes, so the processors have balanced
workload and 2) minimize the number of edges connecting two partitions, which minimizes the
amount of communication across the processors. We are not to go into details of these
algorithms, but we shall assume to have an ideal partition of the mesh.
Figure 10 (b) shows an ideal partitioning of this mesh, with each partition holding exactly
12 nodes. Each of the four colors represents a partition, which is held on one processor. Still
under natural ordering, we color each row of matrix A with the color that corresponds to the
processor where it resides. When the colors interleave, we know that this ordering leads to an
inefficient ILU preconditioner. Consider row 22 (the first yellow row). To process this row,
A22,27 needs information from A19,27 , which lies on the red processor. And to process row 19,
A19,23 requires information from A15,23 , which lies on the blue processor. To factor A under
27
such a configuration, a large number of rows have to be passed back and forth among the
processors at each step, causing the factorization to be unbearably expensive.
One intuitive way to improve this situation is to use some simple nodal reordering to group
each processor’s rows together. For example, we can assign a new number to every node in one
processor before moving on to another, so matrix A looks like in Figure 10 (c). Now the
inter-processor communication is somewhat reduced, and factoring the rows in the first partition
requires no information from the other processors. However, while this ordering scheme can
benefit sequential algorithms, the rows’ dependence on each other still prevents ILU from
running concurrently on all processors.
First, classify the nodes in each processor as interior or interface nodes. Interior nodes are
those with adjacent nodes residing on the same processor, while interface nodes have adjacent
nodes on two or more processors. Figure 11 (a) highlights all the interface nodes. Assign new
numbers to the interior nodes, one processor at a time. When this is finished, there is an m < n
such that nodes k ≤ m are all interior nodes, and nodes k ≥ m + 1 are all interface nodes. In
our case, as illustrated in Figure 11 (b), m = 26 .
The next step is to compute maximal independent sets from the remaining interface nodes,
denoted by AI . An independent set I of a graph G is a subgraph in which no two nodes
are adjacent. Maximal independent sets can be found using Algorithm 9: Luby’s algorithm [19].
Once we have a maximal independent set I of AI , we assign new numbers, in order, to the
nodes in I . Afterwards, AI = AI \ I , I = ∅ , and the process repeats until AI is empty.
Figure 11 (c) ~ (f) illustrate the successive steps of independent sets, along with their new nodal
numbering. Note that at each step, nodes in I are numbered with regard to the order of the
processors.
Figure 11 (g) shows the original mesh with new the nodal numbers, and Figure 11 (h) is the
corresponding matrix A . At the first glance, this matrix seems very poorly ordered. The
nonzero elements are spread out without a clear pattern, and the bandwidth is huge. On a single
processor, as we have already examined, such arrangement yields poor conditioning. Also,
each processor holds largely disjoint rows when row > m , which does not seem to improve from
the original ordering. However, the independence among these new rows enables this
seemingly messy matrix to be highly parallelizable.
28
To compute a maximal independent set I of a given graph G
G =G
While G > 0
For i = 1 to G
Label(Gi )=Random()
End
For i = 1 to G
If Label(Gi )<Label(G j ) for all adj(Gi ,G j )
{ }
I = I ∪ Gi
End
End
G =G\I
{ }
G = G \ Gi ∈ G | ∃I k ∈ I ∋ adj (Gi , I k )
End
The first and entirely parallelizable part of the ILU factorization of A is its interior rows.
As highlighted in Figure 12 (a) with vertical bars, different processors do not have nonzero
elements in the same column. In other words, the interior rows in one processor are completely
independent from interior rows in other processors. Therefore, all processors can factor their
own interior rows simultaneously. No waiting or inter-processor communication is required.
In fact, since each set of these rows can be viewed as an independent matrix, we can locally
enhance it with reordering strategies, such as those discussed in the previous section.
Then, as illustrated in Figure 12 (b) ~ (d), one independent set of interface rows is factored
at a time. While factoring these rows requires information from the above rows, some on
foreign processors, it can still be run in parallel. Because the interface rows within one
independent set do not depend on any row that is within the same independent set and on another
processor, there is no need for the processors to wait for each other. In the best case, all
processors would have equal number of rows within each independent set. Therefore, they can
29
finish together and no processor needs to wait for another before moving on to the next set.
Although this nodal ordering strategy can be used for both structure-based ILU( ),
threshold-based ILUT, and even complete LU algorithms in parallel, slight differences apply to
each implementation. ILU(0) is the simplest, because it requires no fill-in whatsoever and
independent rows remain independent. ILUT is more complicated because its fill-ins introduce
new dependencies during factorization. When computing the independent sets, possible fill-ins
have to be taken into consideration to keep the sets truly independent. As a consequence, each
independent set could be smaller than that of an ordering for ILU(0).
Since the interior rows can be factored completely in parallel, it is to our best interest to
have as many of them as possible. When the mesh/matrix is relatively large compared to the
number of partitions/processors, most of the nodes/rows would be interior. The more interior
nodes outnumber interface nodes, the closer the factorization is to true parallel. Increasing the
number of processors increases the speed of the factorization in parallel, but at the same time
reduce the parallelizability by increasing the percentage of interface rows. In any event, a
high-quality mesh partitioning algorithm is critical for the effectiveness of the parallel ILU.
Our partitioning method for square domains is completely location-based, and our goal is to
assign an equal number of nodes/rows to each processor. First, we draw a horizontal line across
the middle of the mesh. Then, we compare the numbers of nodes falling on either sides of the
line, and move the line toward the smaller side. Eventually, the line bisects the mesh.
Repeatedly applying this process to the two halves of the mesh can easily give us 2n equally
sized partitions. The quality of our partitioning is fairly high, and can be easily automated.
The downside is that this process can be very costly. However, for large problems arising from
time-dependent nonlinear PDEs, this preprocessing cost is negligible when compared to the
remaining calculations.
We choose to partition three levels of refinement of the same domain, each into 2, 4, and 16
pieces, as graphed in Figure 13. The largest mesh (two_hole_4) has over 25 times more nodes
than the smallest mesh (two_hole_0). Because the nodes lay fairly evenly across the domain,
each of our partitions takes up nearly the same physical area. The two holes in the domain
cause the nodal bisection lines to shift; otherwise we should see perfect grids. On each mesh,
different colors indicate the different processors assigned to handle the nodes. Interior nodes
are labeled with circles and interface nodes are labeled with asterisks – though the symbols are
not visible on the largest mesh where all nodes are crammed together.
We list in Tables 3.1 ~ 3.3 the number of total nodes and the number of interior nodes in
each of the nine graphs. Note that our algorithm distributes nodes among the processors as
evenly as possible – although this is not usually feasible. The percentage of interface nodes is
30
listed on the side. The magnitude of this number is the most critical indication of the parallel
efficiency.
For each of the three meshes, the percentage of interface nodes increases as the number of
partitions increases. In particular, when the smallest mesh is distributed to 16 processors, well
over half of its total nodes are interface nodes! Despite the amount of concurrent computing
power available, the majority of our parallel ILU efforts would be wasted transferring data back
and forth among processors. It is very possible that 4 processors can factor this system faster
than 16 processors together.
Also, when there are more partitions, the amount of interface nodes per partition varies
more. When the smallest mesh is partitioned in 4, the number of interface nodes per partition
ranges between 27 and 32. On the other hand, when the same mesh is partitioned into 16, the
range increases to be between 6 and 23. Such large difference among partitions is very
undesirable, because it means great workload disparity for the processors at every step in the
parallel ILU algorithm.
Another thing to observe is the inversely proportional relationship between the mesh size
and the percentage of interface nodes. For all of 2-, 4-, and 16-processor cases, the percentage
drops about one-third from Tables 3.1 to 3.2, and another two-thirds from Tables 3.2 to 3.3.
When the problem size is large, partitioning the mesh would yield relatively few interface nodes
and make it sensible to employ a large number of processors. In Table 3.3 with 16 processors,
we see that 83.85% of all nodes are interior and can be ILU-factored simultaneously without
inter-processor communications. This speeds up the factorization significantly and justifies the
use of many processors.
The main lesson from this experiment is that, the efficiency of parallel ILU does not
necessarily increase with the number of parallel processors. We must regard the nature of the
required nodal reordering scheme, and consider the added dependency among processors as the
mesh is split into smaller partitions. When we can choose an appropriate level of parallelism,
the algorithm can run efficiently in near true parallel.
31
Chapter 6 Conclusions
The aim of this thesis is to study nodal reordering strategies for finite element meshes that
speed up the convergence rate of an iterative solver. We start out by examining some classic
linear system solvers, iterative linear solvers, preconditioners, and reordering strategies. Then,
we proceed to numerically experiment some of the schemes and methods, and to analyze the
strategies on a single processor and on multiple processors.
In the single-processor case, we first compare the classic Cuthill-McKee ordering and the
popular Reverse Cuthill-McKee. While they equally reduce the matrix’s bandwidth, they
behave very differently before preconditioners: RCM only works with structure-based ILU(0),
and CM works best with ILUT with small thresholds. Then, we examine a list of similar
ordering strategies based on the concept of Breadth-First Search, and find that some of them
improve preconditioning qualities even more than the well-known CM and RCM.
In the multi-processor case, we learn that the parallel ILU algorithm is highly dependent on
a nontrivial ordering strategy. Whether or not the algorithm can run efficiently in parallel is
determined by the quality of the mesh partitioning, and the goal is to minimize the amount of
interface nodes in each partition. Assuming perfect partitions exist, the number of partitions
and parallelizability are negatively related. Unless the problem size is large enough, employing
too many processors might actually slow down the computations.
32
Figures and Tables
33
Figure 2 Cuthill-McKee Ordering
This figure demonstrates Cuthill-McKee Algorithm’s ability to reduce bandwidth in matrices. In
the second matrix, nonzero elements are rearranged close to the main diagonal, unlike those
scattered apart in the first matrix. The mesh should be viewed as a cylinder, where the bottom
gray row “wraps around” to the top row.
34
Figure 3 Cuthill-McKee Starting Node
To reach the smallest bandwidth using Cuthill-McKee Algorithm, sometimes we do not want to
start at a node of lowest degree:
35
Figure 4 Finite Element Meshes
two_hole_0 two_hole_4
four_hole_0
four_hole_3
36
cross_dom_0 cross_dom_4
two_dom_0
two_dom_5
37
Figure 5 ILU(0) and ILUT
These figures are for comparison among several ILU implementations’ sparsity patterns. ILU(0)
and ILUT with three thresholds are applied on the same matrix. Incomplete factor U is
superimposed in red over original matrix in blue.
ILU(0) ILUT(0.1)
ILUT(0.01) ILUT(0.001)
38
Figure 6 ILU Experiments
(a) 2d
t Precond nnz
ILUT
0.02 7.3877 529
0.01 3.2354 670
(b) two_hole_0
t Precond nnz
ILUT
0.02 12.249 3985
0.01 7.3715 4862
(c) two_hole_1
T Precond nnz
ILUT
0.02 17.477 6159
0.01 9.3389 7396
(d) two_hole_2
t Precond nnz
ILUT
0.02 41.508 11106
0.01 21.438 13761
39
Figure 7 CM and RCM
Natural ordering
Cuthill-McKee ordering, where nonzero entries line up “horizontally” in the upper triangle.
Reverse Cuthill-McKee order, where nonzero entries line up “vertically” in the upper triangle.
40
Figure 8 “Umbrella Regions”
Although Cuthill-McKee Algorithm and Reverse Cuthill-McKee are very similar, the matrices
they generated have dramatically different “umbrella regions.” Therefore, Incomplete LU
factorizations behave rather differently on them.
41
Figure 9 Natural v. RCM Ordering on ILUT
Matrices from the Figure 6 are reordered with Reverse Cuthill-McKee ordering, and ILUT
preconditioning is applied with the same thresholds. Precondition numbers and sparsity
numbers of the natural-ordering matrices are subtracted from those of the RCM-ordering
matrices. The differences are plotted below.
42
Figure 10 Mesh Partitioning
43
Figure 11 Mesh Partitioning for Parallel ILU
(a) (b)
(c) (d)
44
(e) (f)
(g) (h)
45
Figure 12 Parallel ILU
(a) (b)
(c) (d)
46
Figure 13 Mesh Partitioning Test 1
2 Processors 4 Processors 16 Processors
47
Figure 14 Mesh Partitioning Test 2
(a)
(b)
(c)
(d)
48
Table 1 CM v. RCM and ILU(0) v. ILUT
Table 1.1
2d
49
Table 1.2
3d
50
Table 1.3
two_hole_0
51
Table 1.4
two_hole_1
52
Table 1.5
two_hole_2
53
Table 1.6
two_hole_3
54
Table 1.7
two_hole_4
55
Table 1.8
four_hole_0
56
Table 1.9
four_hole_1
57
Table 1.10
four_hole_2
58
Table 1.11
four_hole_3
59
Table 1.12
cross_dom_0
60
Table 1.13
cross_dom_1
61
Table 1.14
cross_dom_2
62
Table 1.15
cross_dom_3
63
Table 1.16
cross_dom_4
64
Table 1.17
two_dom_0
65
Table 1.18
two_dom_1
66
Table 1.19
two_dom_2
67
Table 1.20
two_dom_3
68
Table 1.21
two_dom_4
69
Table 1.22
two_dom_5
70
Table 2 15 Ordering Schemes and Their Effects on GMRES Iterations
Table 2.1
two_hole_2
ILU(0) 18 22 18 21 20 29 16 34 20 21 19 28 16 32 24
ILUT(1.0e-1) 21 23 21 22 22 28 26 33 29 22 21 26 22 30 22
ILUT(7.5e-2) 20 21 20 21 20 25 21 27 24 21 20 23 22 25 20
ILUT(5.0e-2) 17 16 16 16 16 20 19 20 22 16 15 18 20 17 19
ILUT(2.5e-2) 12 13 13 13 12 14 13 14 14 13 12 12 14 10 12
ILUT(1.0e-2) 10 9 10 10 9 9 10 8 10 10 9 9 10 9 9
ILUT(7.5e-3) 9 9 9 9 8 8 9 8 9 9 9 8 9 9 9
ILUT(5.0e-3) 8 7 7 8 7 7 8 7 8 8 8 7 8 7 7
ILUT(2.5e-3) 7 6 6 6 6 5 7 6 6 6 6 6 7 6 6
ILUT(1.0e-3) 5 5 5 5 5 4 5 4 5 5 5 5 5 4 5
two_hole_3
ILU(0) 25 32 29 31 30 45 24 47 26 27 24 37 23 46 34
ILUT(1.0e-1) 30 32 30 31 30 42 38 50 42 27 26 36 32 43 33
ILUT(7.5e-2) 29 30 30 29 27 39 30 40 34 27 26 35 30 35 28
ILUT(5.0e-2) 25 23 23 23 22 32 28 31 32 22 19 27 27 22 24
ILUT(2.5e-2) 19 18 18 18 18 21 19 26 19 19 18 17 20 14 18
ILUT(1.0e-2) 14 13 14 13 14 12 15 11 14 13 14 12 13 12 13
ILUT(7.5e-3) 13 12 13 11 13 11 13 11 12 12 13 12 13 12 12
ILUT(5.0e-3) 11 11 11 11 11 9 12 10 10 10 11 9 12 10 11
ILUT(2.5e-3) 10 9 10 9 9 7 10 7 9 8 9 8 10 8 9
ILUT(1.0e-3) 7 6 7 6 7 6 7 6 7 6 7 6 7 6 6
71
Table 2.2
four_hole_2
ILU(0) 14 16 13 16 15 21 12 24 15 16 13 19 11 23 18
ILUT(1.0e-1) 17 18 17 18 17 23 20 24 21 18 16 20 17 22 18
ILUT(7.5e-2) 14 15 14 15 15 20 17 21 19 16 15 17 16 19 16
ILUT(5.0e-2) 13 12 12 12 12 14 15 15 16 12 12 13 13 13 13
ILUT(2.5e-2) 9 9 9 10 9 11 10 11 11 10 9 10 10 9 10
ILUT(1.0e-2) 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
ILUT(7.5e-3) 7 7 6 7 6 6 6 6 6 7 6 6 7 6 6
ILUT(5.0e-3) 6 6 6 6 6 6 6 5 5 6 6 6 6 6 5
ILUT(2.5e-3) 5 5 5 5 5 4 5 4 5 5 5 5 5 4 5
ILUT(1.0e-3) 4 4 4 4 4 4 4 4 4 4 4 3 4 3 4
four_hole_3
ILU(0) 20 22 19 22 21 29 17 35 21 22 20 27 17 33 26
ILUT(1.0e-1) 23 24 23 24 23 31 28 37 30 24 23 28 24 32 25
ILUT(7.5e-2) 22 21 21 22 20 27 23 28 26 22 21 23 23 25 22
ILUT(5.0e-2) 19 17 18 17 17 24 19 23 22 18 18 19 20 18 19
ILUT(2.5e-2) 14 14 13 14 13 15 14 18 15 14 13 14 13 12 14
ILUT(1.0e-2) 10 9 10 9 10 9 10 9 10 10 10 10 10 9 9
ILUT(7.5e-3) 9 9 9 9 9 9 9 8 9 9 9 9 9 9 9
ILUT(5.0e-3) 8 7 8 7 8 7 8 7 7 7 8 7 8 7 8
ILUT(2.5e-3) 7 6 7 6 7 6 7 5 6 6 7 6 7 6 6
ILUT(1.0e-3) 5 4 5 5 5 4 5 4 5 4 5 4 5 4 5
72
Table 2.3
73
Table 2.4
74
Table 3 Mesh Partitioning Tests for Parallel ILU
Table 3.1
two_hole_0
Interface % Interface
Processor Nodes
Nodes Nodes
Two Processors
Processor 1 208 27 12.98%
Processor 2 207 31 14.98%
Total 415 58 13.98%
Four Processors
Processor 1 104 32 30.77%
Processor 2 104 27 25.96%
Processor 3 103 31 30.10%
Processor 4 104 29 27.88%
Total 415 119 28.67%
Sixteen Processors
Processor 1 26 6 23.08%
Processor 2 26 13 50.00%
Processor 3 26 18 69.23%
Processor 4 26 21 80.77%
Processor 5 26 20 76.92%
Processor 6 26 13 50.00%
Processor 7 26 22 84.62%
Processor 8 26 17 65.38%
Processor 9 26 17 65.38%
Processor 10 26 23 88.46%
Processor 11 25 15 60.00%
Processor 12 26 13 50.00%
Processor 13 26 22 84.62%
Processor 14 26 17 65.38%
Processor 15 26 18 69.23%
Processor 16 26 16 61.54%
Total 415 271 65.30%
75
Table 3.2
two_hole_2
Interface % Interface
Processor Nodes
Nodes Nodes
Two Processors
Processor 1 558 44 7.89%
Processor 2 557 52 9.34%
Total 1115 96 8.61%
Four Processors
Processor 1 279 49 17.56%
Processor 2 279 45 16.13%
Processor 3 279 52 18.64%
Processor 4 278 46 16.55%
Total 1115 192 17.22%
Sixteen Processors
Processor 1 69 14 20.29%
Processor 2 70 31 44.29%
Processor 3 70 24 34.29%
Processor 4 70 39 55.71%
Processor 5 70 34 48.57%
Processor 6 69 28 40.58%
Processor 7 70 36 51.43%
Processor 8 70 27 38.57%
Processor 9 69 33 47.83%
Processor 10 70 36 51.43%
Processor 11 70 25 35.71%
Processor 12 70 34 48.57%
Processor 13 70 51 72.86%
Processor 14 69 34 49.28%
Processor 15 69 33 47.83%
Processor 16 70 21 30.00%
Total 1115 500 44.84%
76
Table 3.3
two_hole_4
Interface % Interface
Processor Nodes
Nodes Nodes
Two Processors
Processor 1 5298 165 3.11%
Processor 2 5299 147 2.77%
Total 10597 312 2.94%
Four Processors
Processor 1 2649 193 7.29%
Processor 2 2649 121 4.57%
Processor 3 2649 121 4.57%
Processor 4 2650 179 6.75%
Total 10597 614 5.79%
Sixteen Processors
Processor 1 663 44 6.64%
Processor 2 662 103 15.56%
Processor 3 662 95 14.35%
Processor 4 662 144 21.75%
Processor 5 662 93 14.05%
Processor 6 663 86 12.97%
Processor 7 662 123 18.58%
Processor 8 662 112 16.92%
Processor 9 662 112 16.92%
Processor 10 663 120 18.10%
Processor 11 662 73 11.03%
Processor 12 662 113 17.07%
Processor 13 663 203 30.62%
Processor 14 662 100 15.11%
Processor 15 663 115 17.35%
Processor 16 662 75 11.33%
Total 10597 1711 16.15%
77
References
[1] W. Arnoldi. "The Principle of Minimized Iterations in the Solution of the Matrix
Eigenvalue Problem." Quarterly of Applied Mathematics. Vol. 9 (1951), pp. 17-29.
[2] S. Balay, W. Gropp, L. McInnes, and B. Smith. The Portable, Extensible Toolkit for
Scientific Computing (PETSc). Version 2.2.1, Code and Documentation, 2004.
Online. Available 4/18/2005: http://www.mcs.anl.gov/petsc.
[3] E. R. Barnes. “An Algorithm for Partitioning the Nodes of a Graph.” SIAM Journal of
Algebraic Discrete Methods. Vol. 3 (1984), No. 4, pp. 541-550.
[4] S. C. Brenner and L. R. Scott. The Mathematical Theory of Finite Element Methods.
Second Edition. Springer. New York 2002.
[7] E. Cuthill and J. McKee. “Reducing the Bandwidth of Sparse Symmetric Matrices.”
Naval Ship Research and Development Center. ACM/CSC-ER Proceedings of the
1969 24th National Conference, pp. 157-172.
[8] R. Diestel. Graph Theory. Graduate Texts in Mathematics. Springer. New York 2000.
[10] K. A. Gallivan, A. Sameh, and Z. Slatev. "A Parallel Hybrid Sparse Linear System
Solver." Computing Systems in Engineering. Vol. 1 (1990), pp. 183-195.
[11] A. George. "Computer Implementation of the Finite Element Method." Technical Report
STAN-CS-208, Stanford University, Stanford, CA, 1971.
[12] G. Havas and C. Ramsay. "Breadth-First Search and the Andrews-Curtis Conjecture."
International Journal of Algebra and Computation. Vol. 13 (2003), No. 1, pp. 61-68.
[13] D. Hysom and A. Pothen. “Level-based Incomplete LU Factorization: Graph Model and
Algorithms.” Submitted to SIAM Journal on Matrix Analysis and Applications.
November 2002.
[14] M. Jones and P. Plassmann. "Scalable Iterative Solution of Sparse Linear Systems."
78
Parallel Computing. Vol. 20 (1994), pp. 753-773.
[15] G. Karypis and V. Kumar. “A Fast and High Quality Multilevel Scheme for Partitioning
Irregular Graphs.” SIAM Journal on Scientific Computing. Vol. 20 (1998), No. 1,
pp. 359-392.
[18] C. T. Kelly. Iterative Methods for Linear and Nonlinear Equations. SIAM Frontiers in
Applied Mathematics. Philadelphia 1995.
[19] M. Luby. "A Simple Parallel Algorithm for the Maximal Independent Set Problem."
SIAM Journal on Computing. Vol. 15 (1986), pp. 1036-1053.
[21] MathWorks, Inc., The. MATLAB. Version 7.0 (R14). Code and Documentation, 2004.
[22] J. Meijerink and H. Van Der Vorst. "An Iterative Solution Method for Linear Systems of
Which the Coefficient Matrix is a Symmetric M-matrix." Mathematics of
Computation. Vol. 31 (1997), pp. 148-162.
[25] L. Oliker, X. Li, P. Husbands, and R. Biswas. “Effects of Ordering Strategies and
Programming Paradigms on Sparse Matrix Computations.” SIAM Reivew. Vol
44 (2002), No. 3, pp. 373-393.
[26] Pothen and C. J. Fan. “Computing the Block Triangular Form of a Sparse Matrix.”
ACM Transactions on Mathematical Software, Vol. 16 (1990), No. 4, pp 303-324.
79
[27] Y. Saad. "ILUT: A Dual Threshold Incomplete ILU Factorization." Numerical Linear
Algebra with Applications. Vol. 1 (1994), pp. 387-402.
[28] Y. Saad and M. Schultz. "GMRES: A Generalized Minimal Residual Algorithm for Solving
Nonsymmetric Linear Systems." SIAM Journal on Scientific and Statistical
Computing. Vol. 7 (1986), pp. 856-869.
[29] G. Strang and G. Fix. An Analysis of the Finite Element Method. Prentice Hall 1973.
[30] E. Weisstein, et al. "Jacobi Method." MathWorld - A Wolfram Web Resource. Online.
Available 4/19/2005: http://mathworld.wolfram.com/JacobiMethod.html
[33] J. Zhang. “A Multilevel Dual Reordering Strategy for Robust Incomplete LU Factorization
of Indefinite Matrices.” SIAM Journal on Matrix Analysis and Applications. Vol.
22 (2001), No. 3, pp. 925-947.
80
Vita
Peter S. Hou was born in Taipei, Taiwan on December 28th, 1981. He grew up like any
ordinary kid who loved watching cartoons and disliked math. However, perhaps fate spoke, he
was chosen to be the math teacher’s assistant for five consecutive years. He moved to the
United States in 1997 and attended Langley High School in McLean, Virginia. His interest for
math grew as he participated in many math-related activities and brought home numerous awards.
In 2000, he was accepted into Virginia Tech to study Computer Science. One day he had a
strange feeling about a life without taking any more math classes, so he started to double-major
in Applied and Discrete Mathematics.
Outside of classes, he enjoyed the challenges from the Putnam Math Competition, Virginia
Tech Regional Math Contest, and Mathematical Contest in Modeling. During school, he
tutored part-time at the Math Emporium. Between semesters, he worked for ProfitScience,
LLC in McLean, Virginia as a software developer. He joined the Tae Kwon Do Club and the
Math Club, quickly became one of the leaders and has remained active to this day. In addition,
he served as a webmaster for the Class Program, helped prepare for the Ring Dance, and
participated in a number of other community activities.
In May 2004, one year after he was given the honor of Phi Beta Kappa membership, he
earned his two B.S. degrees in Computer Science and Mathematics, as well as a black belt in
Chung Do Kwan Tae Kwon Do. Right afterwards, he continued his studies in Virginia Tech as
a Master’s student in Mathematics. Under the Five-Year Bachelor-Master program and
guidance from Dr. Jeff Borggaard, he will be completing his degree in May 2005. Following
his graduation, he is set to join Mercer Human Resources Consulting in New York City as an
actuary.
81