Overview of Iterative Linear System Solver Packages

| <- Prev | Index | Next -> |
NHSE Review Home
Comment

Chapter 5 Performance tests

We have run performance tests on a number of packages. Ideally, these tests combine all of the following possibilities:

5.1 Machines used

The following machines at the University of Tennessee, Knoxville, were used:

nala Sun Ultra-Sparc 2200 with Solaris 5.5.1. 200MHz, 16K L1, 1Mb L2. Compilers: f77 -O5 and gcc -O2.

cetus lab Sun Ultra-Sparc 1, 143 Mhz, connected by 10Mbps Ethernet.
 
 
 
 
 
 
 
 
N Mfl MSR Mfl VBR (nb=4)
2500 23 -
2744 - 23
10,000 20 -
9261 - 22
22,500 19 22

Table 1: AzTec performance on Nala (section 5.1.


 
 
 
 
 
 
 
 
 
 
 
 
N np=1 np=2 np=4 np=8
2500 26 20 18 16
10,000 20 26 37 45
22,500 19 27 49 68
90,000 17 26 50 89
250,000 16 25 49 95

Table 2: AzTec aggregate Mflop rating for Jacobi preconditioned CG
on the Cetus lab (section 5.1).


 
 
 
 
 
 
 

5.2 Results

5.2.1 Aztec

Problem tested: five-point Laplacian solved with Jacobi CG. We used the sample main program provided, and altered only parameter settings
  We also tested the 7-point Laplacian with 4 variables per grid point, using the VBR format. Since this uses level 2 BLAS routines, it should in principle be able to get higher performance, but in practice we do not see this happening. In the single processor tests in table 1 we see that for small problems there is a slight performance increase due to cache reuse, but not on the order that we
would see for dense operations. The use of Blas2 in the VBR format seems to have no effect.

Aztec's built in timing and flop count does not support the ILU(0) preconditioner, so we added that. The flop count is approximate, but does not overestimate by more than a few percent. We omit the N = 400 tests because they were too short for the timer to be reliable.

From table 2 we see for small problem sizes the communication overhead dominates; for larger problems the performance seems to level off at 13 Mfl per processors, about 10 percent of peak performance. Performance of an ILU(0)- preconditioned method (table 3) is slightly lower. The explanation for this is not immediately clear. Note that, since we used a regular grid problem, it is
not due to indirect addressing overhead.
 
 
 
 
 
 

N np=1 np=2 np=4 np=8
2500 17 19 19 17
10,000 15 21 35 47
22,500 14 21 38 65
90,000 13 20 39 73
250,000 - 20 38 76

Table 3: Aztec aggregate Mflop rating for ILU(0) preconditioned CG
on the Cetus lab (section 5.1).


 
 
 
 
 
 
 
 
 
 
 
 
 
 
N p=1 p=2 p=4
400 5.6 8.8 4.5
2500 5.5 2.4 2.4
10,000 5.5 3.7 4.6
90,000 5.0 5.3 8.4
250,000 4.8 5.5 9.5

Table 4: BlockSolve95 aggregate megaflop rates
on the Cetus lab (section 5.1); one equation per grid point.


 
 
 
 
 
 
 
 
 
 
 

5.2.2 BlockSolve95

We tested the supplied grid5 demo code, with the timing and flop counting data supplied in BlockSolve95. The method was CG preconditioned with ILU.

From table 4 we see that the performance of BlockSolve95 is less than of other packages reported here. This is probably due to the more general data format and the resultant indirect addressing overhead. Results in table 5 show that by inode/clique identification BlockSolve95 can achieve performance comparable to regular grid problems in other packages.

Larger problems than those reported led to segmentation faults, probably because of insufficient memory. Occasionally, but not always, BlockSolve aborts with an 'Out of memory' error.
 

5.2.3 Itpack

Problem tested: five-point Laplacian solved with Jacobi CG. We wrote our own main program to generate the Laplacian matrix in row compressed and diagonal storage format.
 
 
N p=1 p=2 p=4 p=8
400 23(10) 10(2) 5(2) 06(2)
2500 19(9) 20(6) 17(5) 24(5)
10,000 18(8) 25(7.5) 38(9) 54(10)

Table 5: BlockSolve95 aggregate megaflop rates on the Cetus lab (section 5.1); five equations per grid point;
parenthesized results are without inode/clique isolation.


 
 
 
 
 
 
 
 
 
 
 
 
N alloc (Mb) Mfl CRS Mfl Dia
400 .05 19 1
2500 .3 20 8
10,000 1.2 17 14
22,500 2.8 16 15

Table 6: Megaflop rates for Itpack on a single Cetus machine (section 5.1).


 
 
 
 
 
 
 
 
 
 
 
 
N p=1 p=2 p=4 p=8
400 17 4 2 1
2500 18 12 8 7
10,000 15 20 20 24
90,000 13 22 44 75
250,000 13 22 44 88

Table 7: Aggregate megaflop rates for unpreconditioned CG under Petsc on the Cetus lab (section 5.1).


 
 
 
 
 
 
 
 

Certain Itpack files are provided only in single precision. We took the single precision files and compiled them with f77 -r8 -i4, which makes the REALs 8 bytes and INTEGERs 4. It is not clear why diagonal storage will only give good performance on larger problems.
 

5.2.4 Petsc

We tested the Petsc library on Sun UltraSparcs that were connected by both Ethernet and an ATM switch. The results below are for the Ethernet connection, but the ATM numbers were practically indistinguishable.

We wrote our own main program to generate the five-point Laplacian matrix. The method in table 7 is an unpreconditioned CG algorithm.

We tested the efficacy of ILU by specifying
 
 
 
 

PCSetType(pc,PCSOR);
PCSORSetSymmetric(pc,SOR_LOCAL_SYMMETRIC_SWEEP);
which corresponds to a block Jacobi method with a local SSOR solve on-processor. This method, reported in table 8, has a slightly lower performance than the unpreconditioned method, probably due to the larger fraction of indirect-addressing operations.
 

5.2.5 PSparsLib

We added flop counting to the example program dd-jac, which is an additive Schwarz method with a local solve that is ILUT-preconditioned GMRES.

Larger problem sizes ran into what looks like a memory-overwrite. Attempts to allocate more storage failed.
 
 
 
 
 
 
 
N p=1 p=2 p=4 p=8
400 14 2 2 1
2500 15 9 7 6
10,000 12 13 18 20
90,000 10 13 26 45
250,000 - 14 27 52

Table 8: Aggregate megaflop rates for ILU CG under Petsc on the Cetus lab (section 5.1).


 
 
 
 
 
 
 
 
 
 
 
 
N p=1 p=2
400 29 10
2500 26 5

Table 9: Aggregate megaflop rates for PSparsLib on the Cetus lab (section 5.1).


 
 
 
 
 
 
 
 

5.3 Discussion

Although our tests are nowhere near comprehensive, we can state a few general conclusions.

Copyright © 1998