Overview of Iterative Linear System Solver Packages

Chapter 5 Performance tests

We have run performance tests on a number of packages. Ideally, these tests combine all of the following possibilities:

Single processor and parallel where possible.
Using all data structures supplied or accepted by the package.
Comparing various iterative methods, in particular where they have different parallel behaviour, such as the Chebyshev method versus the Conjugate Gradient method.
Assessing the efficacy of different preconditioners, measuring separately
and combined:

-- Cost of setup,
-- Reduction in numbers of iterations,
-- Number of flops per iteration,
-- Flop rate of the solve performed in each iteration.

Solving different systems of the same size and structure is not of much use, as this only changes the number of iterations performed; one could note how many iterations are necessary to offset the initial setup cost.

5.1 Machines used

The following machines at the University of Tennessee, Knoxville, were used:

nala Sun Ultra-Sparc 2200 with Solaris 5.5.1. 200MHz, 16K L1, 1Mb L2. Compilers: f77 -O5 and gcc -O2.

cetus lab Sun Ultra-Sparc 1, 143 Mhz, connected by 10Mbps Ethernet.

N Mfl MSR Mfl VBR (nb=4)

2500 23 -

2744 - 23

10,000 20 -

9261 - 22

22,500 19 22

Table 1: AzTec performance on Nala (section 5.1.

N	np=1	np=2	np=4	np=8
2500	26	20	18	16
10,000	20	26	37	45
22,500	19	27	49	68
90,000	17	26	50	89
250,000	16	25	49	95

Table 2: AzTec aggregate Mflop rating for Jacobi preconditioned CG
on the Cetus lab (section 5.1).

5.2 Results

5.2.1 Aztec

Problem tested: five-point Laplacian solved with Jacobi CG. We used the sample main program provided, and altered only parameter settings

CG instead of CGS,
Block Jacobi preconditioner,
5-point instead of 7-point matrix.

We also tested the 7-point Laplacian with 4 variables per grid point, using the VBR format. Since this uses level 2 BLAS routines, it should in principle be able to get higher performance, but in practice we do not see this happening. In the single processor tests in table 1 we see that for small problems there is a slight performance increase due to cache reuse, but not on the order that we
would see for dense operations. The use of Blas2 in the VBR format seems to have no effect.

Aztec's built in timing and flop count does not support the ILU(0) preconditioner, so we added that. The flop count is approximate, but does not overestimate by more than a few percent. We omit the N = 400 tests because they were too short for the timer to be reliable.

From table 2 we see for small problem sizes the communication overhead dominates; for larger problems the performance seems to level off at 13 Mfl per processors, about 10 percent of peak performance. Performance of an ILU(0)- preconditioned method (table 3) is slightly lower. The explanation for this is not immediately clear. Note that, since we used a regular grid problem, it is
not due to indirect addressing overhead.

N	np=1	np=2	np=4	np=8
2500	17	19	19	17
10,000	15	21	35	47
22,500	14	21	38	65
90,000	13	20	39	73
250,000	-	20	38	76

Table 3: Aztec aggregate Mflop rating for ILU(0) preconditioned CG
on the Cetus lab (section 5.1).

N	p=1	p=2	p=4
400	5.6	8.8	4.5
2500	5.5	2.4	2.4
10,000	5.5	3.7	4.6
90,000	5.0	5.3	8.4
250,000	4.8	5.5	9.5

Table 4: BlockSolve95 aggregate megaflop rates
on the Cetus lab (section 5.1); one equation per grid point.

5.2.2 BlockSolve95

We tested the supplied grid5 demo code, with the timing and flop counting data supplied in BlockSolve95. The method was CG preconditioned with ILU.

From table 4 we see that the performance of BlockSolve95 is less than of other packages reported here. This is probably due to the more general data format and the resultant indirect addressing overhead. Results in table 5 show that by inode/clique identification BlockSolve95 can achieve performance comparable to regular grid problems in other packages.

Larger problems than those reported led to segmentation faults, probably because of insufficient memory. Occasionally, but not always, BlockSolve aborts with an 'Out of memory' error.

5.2.3 Itpack

Problem tested: five-point Laplacian solved with Jacobi CG. We wrote our own main program to generate the Laplacian matrix in row compressed and diagonal storage format.

N	p=1	p=2	p=4	p=8
400	23(10)	10(2)	5(2)	06(2)
2500	19(9)	20(6)	17(5)	24(5)
10,000	18(8)	25(7.5)	38(9)	54(10)

Table 5: BlockSolve95 aggregate megaflop rates on the Cetus lab (section 5.1); five equations per grid point;
parenthesized results are without inode/clique isolation.

N	alloc (Mb)	Mfl CRS	Mfl Dia
400	.05	19	1
2500	.3	20	8
10,000	1.2	17	14
22,500	2.8	16	15

Table 6: Megaflop rates for Itpack on a single Cetus machine (section 5.1).

N	p=1	p=2	p=4	p=8
400	17	4	2	1
2500	18	12	8	7
10,000	15	20	20	24
90,000	13	22	44	75
250,000	13	22	44	88

Table 7: Aggregate megaflop rates for unpreconditioned CG under Petsc on the Cetus lab (section 5.1).

Certain Itpack files are provided only in single precision. We took the single precision files and compiled them with f77 -r8 -i4, which makes the REALs 8 bytes and INTEGERs 4. It is not clear why diagonal storage will only give good performance on larger problems.

5.2.4 Petsc

We tested the Petsc library on Sun UltraSparcs that were connected by both Ethernet and an ATM switch. The results below are for the Ethernet connection, but the ATM numbers were practically indistinguishable.

We wrote our own main program to generate the five-point Laplacian matrix. The method in table 7 is an unpreconditioned CG algorithm.

We tested the efficacy of ILU by specifying

PCSetType(pc,PCSOR);
PCSORSetSymmetric(pc,SOR_LOCAL_SYMMETRIC_SWEEP);

which corresponds to a block Jacobi method with a local SSOR solve on-processor. This method, reported in table 8, has a slightly lower performance than the unpreconditioned method, probably due to the larger fraction of indirect-addressing operations.

5.2.5 PSparsLib

We added flop counting to the example program dd-jac, which is an additive Schwarz method with a local solve that is ILUT-preconditioned GMRES.

Larger problem sizes ran into what looks like a memory-overwrite. Attempts to allocate more storage failed.

N p=1 p=2 p=4 p=8

400 14 2 2 1

2500 15 9 7 6

10,000 12 13 18 20

90,000 10 13 26 45

250,000 - 14 27 52

Table 8: Aggregate megaflop rates for ILU CG under Petsc on the Cetus lab (section 5.1).

N	p=1	p=2
400	29	10
2500	26	5

Table 9: Aggregate megaflop rates for PSparsLib on the Cetus lab (section 5.1).

5.3 Discussion

Although our tests are nowhere near comprehensive, we can state a few general conclusions.

A sequential code should be able to attain 10--20% of the peak speed of the machine. This value was attained by most packages, using a variety of storage formats.
Parallel codes have significant overhead; for large enough problems this is amortized to where the per-processor performance is about half of that of the sequential code
Inode/clique identification can make a large difference in systems that have multiple variables per node.

| <- Prev | Index | Next -> |

NHSE Review Home

Comments

N	Mfl MSR	Mfl VBR (nb=4)
2500	23	-
2744	-	23
10,000	20	-
9261	-	22
22,500	19	22