-- Cost of setup,
-- Reduction in numbers of iterations,
-- Number of flops per iteration,
-- Flop rate of the solve performed in each iteration.
nala Sun Ultra-Sparc 2200 with Solaris 5.5.1. 200MHz, 16K L1, 1Mb L2. Compilers: f77 -O5 and gcc -O2.
cetus lab Sun Ultra-Sparc 1, 143 Mhz, connected by 10Mbps Ethernet.
N | Mfl MSR | Mfl VBR (nb=4) |
2500 | 23 | - |
2744 | - | 23 |
10,000 | 20 | - |
9261 | - | 22 |
22,500 | 19 | 22 |
Table 1: AzTec performance on Nala (section 5.1.
N | np=1 | np=2 | np=4 | np=8 |
2500 | 26 | 20 | 18 | 16 |
10,000 | 20 | 26 | 37 | 45 |
22,500 | 19 | 27 | 49 | 68 |
90,000 | 17 | 26 | 50 | 89 |
250,000 | 16 | 25 | 49 | 95 |
Table 2: AzTec aggregate Mflop rating for Jacobi preconditioned CG
on the Cetus lab (section 5.1).
Aztec's built in timing and flop count does not support the ILU(0) preconditioner, so we added that. The flop count is approximate, but does not overestimate by more than a few percent. We omit the N = 400 tests because they were too short for the timer to be reliable.
From table 2 we see for small problem sizes the communication overhead
dominates; for larger problems the performance seems to level off at 13
Mfl per processors, about 10 percent of peak performance. Performance of
an ILU(0)- preconditioned method (table 3) is slightly lower. The explanation
for this is not immediately clear. Note that, since we used a regular grid
problem, it is
not due to indirect addressing overhead.
N | np=1 | np=2 | np=4 | np=8 |
2500 | 17 | 19 | 19 | 17 |
10,000 | 15 | 21 | 35 | 47 |
22,500 | 14 | 21 | 38 | 65 |
90,000 | 13 | 20 | 39 | 73 |
250,000 | - | 20 | 38 | 76 |
Table 3: Aztec aggregate Mflop rating for ILU(0) preconditioned CG
on the Cetus lab (section 5.1).
N | p=1 | p=2 | p=4 |
400 | 5.6 | 8.8 | 4.5 |
2500 | 5.5 | 2.4 | 2.4 |
10,000 | 5.5 | 3.7 | 4.6 |
90,000 | 5.0 | 5.3 | 8.4 |
250,000 | 4.8 | 5.5 | 9.5 |
Table 4: BlockSolve95 aggregate megaflop rates
on the Cetus lab (section 5.1); one equation per grid point.
From table 4 we see that the performance of BlockSolve95 is less than of other packages reported here. This is probably due to the more general data format and the resultant indirect addressing overhead. Results in table 5 show that by inode/clique identification BlockSolve95 can achieve performance comparable to regular grid problems in other packages.
Larger problems than those reported led to segmentation faults, probably
because of insufficient memory. Occasionally, but not always, BlockSolve
aborts with an 'Out of memory' error.
N | p=1 | p=2 | p=4 | p=8 |
400 | 23(10) | 10(2) | 5(2) | 06(2) |
2500 | 19(9) | 20(6) | 17(5) | 24(5) |
10,000 | 18(8) | 25(7.5) | 38(9) | 54(10) |
Table 5: BlockSolve95 aggregate megaflop rates on the Cetus lab (section
5.1); five equations per grid point;
parenthesized results are without inode/clique isolation.
N | alloc (Mb) | Mfl CRS | Mfl Dia |
400 | .05 | 19 | 1 |
2500 | .3 | 20 | 8 |
10,000 | 1.2 | 17 | 14 |
22,500 | 2.8 | 16 | 15 |
Table 6: Megaflop rates for Itpack on a single Cetus machine (section 5.1).
N | p=1 | p=2 | p=4 | p=8 |
400 | 17 | 4 | 2 | 1 |
2500 | 18 | 12 | 8 | 7 |
10,000 | 15 | 20 | 20 | 24 |
90,000 | 13 | 22 | 44 | 75 |
250,000 | 13 | 22 | 44 | 88 |
Table 7: Aggregate megaflop rates for unpreconditioned CG under Petsc on the Cetus lab (section 5.1).
Certain Itpack files are provided only in single precision. We took
the single precision files and compiled them with f77 -r8 -i4,
which makes the REALs 8 bytes and INTEGERs 4. It is not clear why diagonal
storage will only give good performance on larger problems.
We wrote our own main program to generate the five-point Laplacian matrix. The method in table 7 is an unpreconditioned CG algorithm.
We tested the efficacy of ILU by specifying
PCSetType(pc,PCSOR);which corresponds to a block Jacobi method with a local SSOR solve on-processor. This method, reported in table 8, has a slightly lower performance than the unpreconditioned method, probably due to the larger fraction of indirect-addressing operations.
PCSORSetSymmetric(pc,SOR_LOCAL_SYMMETRIC_SWEEP);
Larger problem sizes ran into what looks like a memory-overwrite. Attempts
to allocate more storage failed.
N | p=1 | p=2 | p=4 | p=8 |
400 | 14 | 2 | 2 | 1 |
2500 | 15 | 9 | 7 | 6 |
10,000 | 12 | 13 | 18 | 20 |
90,000 | 10 | 13 | 26 | 45 |
250,000 | - | 14 | 27 | 52 |
Table 8: Aggregate megaflop rates for ILU CG under Petsc on the Cetus lab (section 5.1).
N | p=1 | p=2 |
400 | 29 | 10 |
2500 | 26 | 5 |
Table 9: Aggregate megaflop rates for PSparsLib on the Cetus lab (section 5.1).