Overview of Iterative Linear System Solver Packages
Chapter 2 Discussion
Iterative methods are subject to several design decisions that affect ease
of use of the software and the resulting performance. In this section I
will give a global discussion of the issues involved, and how certain points
are addressed in the packages under review.
2.1 Preconditioners
A good preconditioner is necessary for the convergence of iterative methods
as the problem to be solved becomes more difficult. Good preconditioners
are hard to design, and this especially holds true in the case of parallel
processing. Here is a short inventory of the various kinds of preconditioners
found in the packages reviewed.
2.1.1 About incomplete factorisation preconditioners
Incomplete factorisations are among the most successful preconditioners
developed for single-processor computers. Unfortunately, since they are
implicit in nature, they cannot immediately be used on parallel architectures.
Most packages therefore concentrate on more parallel methods such as Block
Jacobi or Additive Schwarz. There are a few implementations of multicolour
incomplete factorisations.
BlockSolve95 is the only package catching breakdown of the ILU or IC
factorisation. The manual outlines code that successively shifts the diagonal
until the factorisation succeeds.
2.1.2 Preconditioners for sequential processing
On sequential architectures, the BPKIT package provides sophisticated block
factorisations, and LASpack contains multigrid solvers. Most other packages
supply the user with variants of ILU (Incomplete LU) or IC (Incomplete
Cholesky).
2.1.3 Preconditioners for parallel iterative methods
These are the approaches towards parallel preconditioning taken in the
packages under review here.
Direct approximations of the inverse SPAI (section 3.13)
is the only package that provides a direct approximation method to the
inverse of the coefficient matrix. Since such an approximation is applied
directly, using a matrix-vector product, it is trivially parallel. The
SPAI preconditioner is in addition also generated fully in parallel.
Block Jacobi Each processor solves its own subsystem,
and there is no communication between the processors. Since this strategy
neglects the global/implicit properties of the linear system, only a limited
improvement in the number of iterations can result. On the other hand,
this type of method is very parallel.
All parallel preconditioner packages provide some form of Block Jacobi
method.
Additive Schwarz As in the Block Jacobi method, each processor
solves a local subsystem. However, the local system is now augmented to
include bordering variables, which belong to other processors. A certain
amount of communication is now necessary, and the convergence improvement
can by much higher.
This method is available in Aztec (3.1), Petsc (3.10), ParPre (3.8),
PSparselib (3.12).
Multicolour factorisations It is possible to perform global incomplete
factorisation if the domain is not only split over the processors, but
is also augmented with a multicolour structure. Under the reasonable assumption
that each processor has variables of every colour, both the factorisation
and the solution with a system thus ordered are parallel. The number of
synchronisation points is equal to the number of colours.
This is the method supplied in BlockSolve95 (3.2); it is also available
in ParPre (3.8).
Block factorisation It is possible to implement block
SSOR or ILU methods, with the subblocks corresponding to the local systems
(with overlap, this gives the Multiplicative Schwarz method). Such factorisations
are necessarily more sequential than Block Jacobi or Additive Schwarz methods,
but they are also more accurate. With an appropriately chosen processor
ordering (e.g., multicolour instead of sequential) the solution time is
only a small multiple times that of Block Jacobi and Additive Schwarz methods.
Such block factorisations are available in Parpre (3.8); PSparselib
(3.12) has the Multiplicative Schwarz method, under the name `multicolour
SOR'.
Multilevel methods Petsc (3.10) and ParPre (3.8) are
the only packages supplying variants of (algebraic) multigrid methods in
parallel.
Schur complement methods ParPre (3.8) and PSparselib
(3.12) contain Schur complement domain decomposition methods.
2.2 Data structure issues
Every iterative method contains a matrix-vector product and a preconditioner
solve. Since these are inextricably tied to the data structures used for
the matrix and the preconditioner (and possibly even to the data structure
used for vectors), perhaps the most important design decision for an iterative
methods package is the choice of the data structure.
2.2.1 The interface to user data
The more complicated the data structure, the more urgent the question becomes
of how the user is to supply the structures to the package. This question
is particularly important in a parallel context.
The packages reviewed here have taken the following list of approaches
to this problem.
-
Fully internal, not directly accessible, data structures (basically object
oriented programming). This approach is taken in Petsc (3.10), ParPre (3.8),and
BlockSolve (3.2). The data structures here are only accessible through
function calls. This means that the package has to supply routines for
constructing the structures, for inspecting them, and for applying them.
-
Prescribed, user supplied, data structures. This approach is taken in Aztec
(3.1); it is available in PCG (3.9).Here the user has to construct the
data structures, but the package supplies the product and solve routines.
-
User supplied, arbitrary, data structures, with user supplied product and
solve routines. This approach is available in PCG (3.9) and Petsc (3.10);
the object-oriented package IML++ (3.5) can also be considered to use this
approach.
-
User defined data structures, not passed to the iterative method at all;
product and solve operations are requested through a reverse communication
interface. This approach is taken in PIM (3.11); it is also available in
PCG (3.9).
2.2.2 Parallel data layout
There are several ways of partitioning a sparse matrix over multiple processors.
The scheme supported by all packages is partitioning by block rows.
-
Each processor receives a contiguous series of rows of the matrix. This
is the approach taken in Petsc (3.10); it is available in BlockSolve95
(3.2). Under this approach a processor can determine the ownership of any
variable, by keeping a short list of first and last rows of each processor.
-
Each processor receives an arbitrary set or rows. This is the approach
taken in Aztec (3.1); it is available in BlockSolve95 (3.2). Under this
scheme, each processor needs to maintain the mapping of its local rows
to the global numbering. It is now no longer trivial to determine ownership
of a variable.
When a sparse matrix is distributed, the local matrix on each processor
needs to have its column indices mapped from the global to the local numbering.
Various packages offer support for this renumbering.
2.3 High performance computing
Sparse matrix problems are notoriously low performers. Most of this is
related to the fact that there is little or no data reuse, thereby preventing
the use of BLAS kernels. See for example the tests on vector architectures
in [4].
Other performance problems stem from parallel aspects of the solution
methods.
Here are then a few of the approaches taken to alleviate performance
problems.
2.3.1 Quotient graph computation
A matrix defines a graph by introducing an edge (i, j) for every
nonzero element a_ij. A dense matrix in this manner defines a graph
in which each node is connected to every other node. This is also called
a `clique'. Sparse matrices, on the other hand, induce a graph where, no
matter the number of nodes, each node is connected to only a small number
of other nodes.
However, sparse matrices from problems with multiple physical variables
will have dense subblocks, for instance corresponding to the variables
on any given node. It is possible to increase the efficiency of linear
algebra operations by imposing on the matrix the block structure induced
by these small dense subblocks. For instance, the scalar multiplication/division
a_ika_kk-1a_ki
that
appears in Gaussian elimination now becomes a matrix inversion and a matrix-matrix
multiplication, in other words, BLAS3 kernels.
Identifying cliques in the matrix graph, and deriving a quotient graph
by `factoring them out', is the approach taken in the BlockSolve package;
section 3.2.
The PCG package (section 3.9) takes the opposite approach in its Regular
Grid Stencil data format. To get dense subblocks, the degrees of freedom
have to be numbered first in regular storage, but in PCG they are numbered
last. This approach is more geared towards vector architectures.
2.3.2 Inner products
Most linear algebra operations in the iterative solution of sparse systems
are easily parallelised. The one exception concerns the inner products
that appear in all iterative methods (with the exception of the Chebyshev
iteration). Since the collection and redistribution of data in an inner
product will inevitably have some processors waiting, the number of inner
products should be kept low.
The conjugate gradients and bi-conjugate gradients algorithms have two
interdependent inner products, and various people have suggested ways to
reduce this to two independent ones, which can then combine their communication
stages. See [3] and the references cited therein.
This approach has been taken in the PIM package; section 3.11.
The second source of inner products is the orthogonalisation stage in
the GMRES algorithm. Using Gram-Schmidt orthogonalisation, the inner products
are independent and can be combined; however, this makes the resulting
algorithm unstable. Modified Gram-Schmidt gives a more stable algorithm,
but the inner products are interdependent, and have to be processed in
sequence. A middle ground is to apply (unmodified) Gram-Schmidt twice.
2.4 Language
The languages used to implement the packages here are C, C++, and Fortran.
To some extent the implementation language determines from what language
the library can be called: C++ constructs can not be used from C, and if
a C routine returns an internally allocated array, this routine cannot
directly be used from Fortran. The Petsc library addresses this last point
in its Fortran interface.
Copyright ©
1998