From: Richard Miller Date: Wed Nov 2 11:41:18 1994 Here are benchmark figures for the Oxford BSP Library Version 1.1, measured on 27.10.94. 'Native' versions of the library were tested on each of the parallel machines below: RS6K - cluster of IBM RS6000's on a token-ring SP1 - IBM SP1 with high-speed switch SGI - Silicon Graphics Power Challenge T3D - Cray T3D In the table below, S - measured sustainable processor speed (in megaflop/sec) P - number of processors L - superstep synchronisation cost (in flop) GInf - asymptotic communication cost for a permutation (in flop/word) NHalf - 'half-bandwidth' data object size (in word) -- i.e. the minimum number of words per remote store operation to achieve half the asymptotic bandwidth. If each bsp_store communicates n words, the effective G should be GInf * (1 + NHalf/n) . Machine S P L GInf NHalf ----------------------------------------------------- RS6K 17 2 67000 150 13 4 220000 300 SP1 25 2 7400 58 24 4 13000 59 8 18000 60 SGI 75 2 1500 9 88 4 2100 9 T3D 10 2 270 1.1 20 4 230 1.2 8 220 1.3 16 260 1.3 ... 512 260 1.6 1024 260 1.7 Benchmark of AMBER/BSP (18.10.94) ================================= The test problem is one I have used before on smaller (<80 PE's) MIMD systems. It is really too small for an MPP of >100 PE's, but it's all I could get my hands on during the one week I had access to the 1024-PE T3D. A much bigger problem would produce much better parallel speedups on the T3D. Description of problem: . AMBER molecular dynamics (100 MD steps) on a polysaccharide in water . 10,704 atoms . nonbonded pair cutoff of 10 Angstroms (=> 2,376,022 pairs) . SHAKE bond-length constraint applied to all bonds Results on Cray T3D =================== Total elapsed time for the run in seconds (excluding disk I/O): #PE's Time Speedup 1 2239.92 1.0 2 1165.10 1.9 4 543.69 4.1 <-- Superlinear speedup results from 8 276.61 8.1 <-- better cache-hit ratio 16 142.95 15.7 32 76.53 29.3 64 44.30 50.6 128 27.27 82.1 256 19.92 112.4 Y/MP 223.9 --- <-- CPU time for same problem on Cray Y/MP To test the effect of an increase in problem size, I ran the same problem with a nonbonded pair cutoff of 15 Angstroms (=> 7,924,551 pairs). Since the larger problem would not fit on a single PE, the "Speedup" figures below are relative to an estimate of single-processor runtime. Total elapsed time for increased problem size (excluding disk I/O): #PE's Time Speedup 8 871.33 8.0 16 449.48 15.5 32 229.28 30.4 64 120.16 58.0 128 65.38 106.6 256 38.24 182.3 512 28.62 243.6 1024 21.40 325.7 Y/MP 602.9 --- <-- CPU time for same problem on Cray Y/MP Results on other BSP machines ============================= Because the parallelisation of AMBER was based on the architecture-independent Oxford BSP library, it was possible to run the program unchanged on other parallel machines at Oxford Parallel: RS6K - a cluster of IBM RS6000's on a token-ring SP1 - an IBM SP1 with a high-speed switch SGI - a shared-memory Silicon Graphics Power Challenge Elapsed time for the original problem on these machines: Machine #PE's Time Speedup RS6K 1 1600.67 1 2 925.73 1.7 4 663.16 2.4 SP1 1 1077.35 1 2 586.55 1.8 4 326.26 3.3 8 202.47 5.3 SGI 1 431.86 1 2 221.97 1.9 4 115.58 3.7 -------------------------------------------------------------------------------- Date: Tue, 29 Aug 95 15:54:44 +0100 From: P.H.Welch@ukc.ac.uk To: Kevin.Parrott@comlab.oxford.ac.uk, Richard.Miller@comlab.ox.ac.uk, Subject: BSP figures? Dear Kevin/Richard, Heather Liddell forwarded to me a note from Richard Miller (2/11/95, that Kevin had forwarded to her) about BSP performance on a number of architectures, including the Edinburgh T3D. Please could you help me fill in a couple of gaps in my understanding? > ---------- Begin Forwarded Message ---------- > >From Richard.Miller@comlab Wed Nov 2 11:41:18 1994 > > Here are benchmark figures for the Oxford BSP Library Version 1.1, > measured on 27.10.94. 'Native' versions of the library were tested on > each of the parallel machines below: > > RS6K - cluster of IBM RS6000's on a token-ring > SP1 - IBM SP1 with high-speed switch > SGI - Silicon Graphics Power Challenge > T3D - Cray T3D > > In the table below, > > S - measured sustainable processor speed (in megaflop/sec) > P - number of processors > L - superstep synchronisation cost (in flop) > GInf - asymptotic communication cost for a permutation (in flop/word) > NHalf - 'half-bandwidth' data object size (in word) -- i.e. the minimum > number of words per remote store operation to achieve half the > asymptotic bandwidth. If each bsp_store communicates n words, > the effective G should be GInf * (1 + NHalf/n) . > > > Machine S P L GInf NHalf > ----------------------------------------------------- > RS6K 17 2 67000 150 13 > 4 220000 300 > > SP1 25 2 7400 58 24 > 4 13000 59 > 8 18000 60 > > SGI 75 2 1500 9 88 > 4 2100 9 > > T3D 10 2 270 1.1 20 > 4 230 1.2 > 8 220 1.3 > 16 260 1.3 > ... > 512 260 1.6 > 1024 260 1.7 > > ----------- End Forwarded Message ----------- I don't understand the `S' column - e.g. why is it there? The other columns give fundamental costs of the infrastructure that are not dependant on application. The `S' column gives only one figure for all values of `P' - what is this? It's defined as the `measured' sustainable processor speed ... on what program is it measured? For example, 10 megaflop/sec. for a single T3D processor is very slow compared to around 50 megaflop/sec. obtained per processor as reported by NASA for their `Embarassingly Parallel' benchmark?? Also, GInf is the `... cost for a permutation ...'. Should that be `... cost for a communication ...'? I'm afraid my BSP is not very good - I don't know the relevance of `permutation' here! > Benchmark of AMBER/BSP (18.10.94) > ================================= > > ... > Description of problem: > > . AMBER molecular dynamics (100 MD steps) on a polysaccharide in water > . 10,704 atoms > . nonbonded pair cutoff of 10 Angstroms (=> 2,376,022 pairs) > . SHAKE bond-length constraint applied to all bonds > > > Results on Cray T3D > =================== > > Total elapsed time for the run in seconds (excluding disk I/O): > > #PE's Time Speedup > 1 2239.92 1.0 > 2 1165.10 1.9 > 4 543.69 4.1 <-- Superlinear speedup results from > 8 276.61 8.1 <-- better cache-hit ratio > 16 142.95 15.7 > 32 76.53 29.3 > 64 44.30 50.6 > 128 27.27 82.1 > 256 19.92 112.4 > Y/MP 223.9 --- <-- CPU time for same problem on Cray Y/MP Can you tell me how many floating-point operations are executed here? I'm afraid I want to calculate absolute performance as well as speedup. I understand that the Oxford BSP library at present allows only one process (i.e. superstep) per processor ... is that right? If so, do not Valiant's predictions about scalability require `sufficient' parallel slackness - i.e. many more processes than processors?? May that explain low efficiencies that may be present? > To test the effect of an increase in problem size, I ran the same problem > with a nonbonded pair cutoff of 15 Angstroms (=> 7,924,551 pairs). > Since the larger problem would not fit on a single PE, the "Speedup" figures > below are relative to an estimate of single-processor runtime. > > Total elapsed time for increased problem size (excluding disk I/O): > > #PE's Time Speedup > 8 871.33 8.0 > 16 449.48 15.5 > 32 229.28 30.4 > 64 120.16 58.0 > 128 65.38 106.6 > 256 38.24 182.3 > 512 28.62 243.6 > 1024 21.40 325.7 > Y/MP 602.9 --- <-- CPU time for same problem on Cray Y/MP Also, how many floating-point operations are executed here? If you can spare the time to put me straight on these things, I'd be very grateful. Many thanks, Peter Welch. -------------------------------------------------------------------------------- Date: Fri, 1 Sep 1995 20:20:08 GMT From: Richard Miller Subject: Re: BSP figures? In reply to Peter Welch's BSP library questions: I don't understand the `S' column - e.g. why is it there? The other columns give fundamental costs of the infrastructure that are not dependant on application. Valiant's original BSP papers used only three architectural performance parameters: L, g and p. For the purpose of complexity analysis of parallel algorithms, it is conventional to express the time costs of synchronisation and communication in terms of 'basic computation steps', instead of in seconds: on a machine with synchronisation time L, each processor could perform L local computation steps in the time it takes to perform a global synchronisation. Of course the nature of a 'basic computation step' is application dependent. That's not a problem if we are just proving asymptotic efficiency of algorithms on abstract machines. But if we want to use the model to predict the performance of implemented algorithms on real hardware, we need to know what the units of measurement are. I expect that's why Bill McColl introduced a fourth parameter, S, which allows us to relate computation steps to absolute time. The `S' column gives only one figure for all values of `P' - what is this? It's the computation speed (in megaflop/sec) of each processor. (There's a tacit assumption that all processors run at the same speed.) It's defined as the `measured' sustainable processor speed ... on what program is it measured? It's measured by performing a dot-product of two long vectors. (Where "long" means big enough to overflow the machine's first-level cache.) For example, 10 megaflop/sec. for a single T3D processor is very slow compared to around 50 megaflop/sec. obtained per processor as reported by NASA for their `Embarassingly Parallel' benchmark?? The dot-product-megaflop/sec number is just my arbitrary choice, for the purpose of providing a benchmark. (In fact I believe it's a fairly good predictor of achievable speed on real applications, based on my experience and conversations with other users as well as people at Cray...) But if you measure an actual computation rate of S' megaflop/sec on your own application, then you will want to multiply the published L and g by S'/S when doing BSP cost calculations. Also, GInf is the `... cost for a permutation ...'. Should that be `... cost for a communication ...'? I'm afraid my BSP is not very good - I don't know the relevance of `permutation' here! Sorry, it means a global communication in which each processor is sending data to one other processor, and each processor is receiving from one other: i.e. the list of receivers is a permutation of the list of senders. > Benchmark of AMBER/BSP (18.10.94) > ================================= > > ... Can you tell me how many floating-point operations are executed here? I'm afraid I want to calculate absolute performance as well as speedup. I have a record of the number of floating-point operations executed on the Y/MP (as reported by the Cray profiling software -- I can't swear to its accuracy). But it's buried on a tape somewhere ... give me a few days to dig it out. I understand that the Oxford BSP library at present allows only one process (i.e. superstep) per processor ... is that right? If so, do not Valiant's predictions about scalability require `sufficient' parallel slackness - i.e. many more processes than processors?? May that explain low efficiencies that may be present? The UNIX-based implementations of the library allow only one BSP process per UNIX process -- i.e. no multithreading -- but on most versions it is possible to start multiple UNIX processes on each processor. (This is architecture-dependent, though: you can't do it on the SP1 if you use the high-speed switch directly, and you can't do it on the T3D at all). However the performance parameters are only valid for one process per processor: synchronisation and communication within a processor are (usually but not always!) cheaper than between processors. The Oxford library's asychronous assignments constitute a kind of parallel slackness: if you look at the formal semantics of the programming model underlying the library (in my DPhil thesis), you will see that the fetch and store operations become separate CSP processes which run in parallel with the invoking process. If each process can do enough (log p?) fetches or stores in a superstep, the overlap of communication and computation should give enough latency hiding for efficient operation. ... If you can spare the time to put me straight on these things, I'd be very grateful. Many thanks, Peter Welch. I'll try to get the AMBER numbers to you shortly. -- Richard Miller -------------------------------------------------------------------------------- Date: Tue, 5 Sep 1995 15:56:12 GMT From: Richard Miller Subject: Re: BSP/AMBER benchmark flop count Further to Peter Welch's questions about the BSP/AMBER benchmark, my records show that the Cray Y/MP reported about 25.5*10^9 flops for the whole run (with the 10,704-atom example). This is equivalent to the following absolute megaflop rates for the benchmarked machines: CPU rate CPU rate Nodes per node parallel RS6000 4 16 39 Y/MP 1 114 114 SP1 8 24 126 SGI 4 59 222 T3D 256 11 1286 Note that the single-node speeds are pretty close to my published 'S' values (obtained by a simple-minded dot-product measurement), except in the case of the SGI Power Challenge (S=75, AMBER speed=59) -------------------------------------------------------------------------------- Date: Thu, 7 Sep 95 18:52:08 BST From: Kevin.Parrott@comlab.ox.ac.uk Subject: Re: HPC/Eficiency workshop Dear Peter, I am sorry but I have commitments that mean that I cannot make yoour meeting. I noted the correspondence with Richar Miller and hope that you are in the picture as far as the technical aspects of BSP is concerned. Some additional comments of my own. I believe BSP is a good solution to part of the pain of parallel programming in that it provides a simple discipline which encourages the aggregation of remote memory access that is necessary for most applications to be efficient on parallel systems. Our BSP implementation does not make use of the randomisation and parallel slackness that Valiant describes in his original paper; however many scientific algorithms have a BSP character in that the aggregation I mentioned can be explicitly organised by the programmer. One can also predict with reasonable confidence how an application will perform on a given system by making use of the BSP cost model (provided this makes use of the application`s floating point performance rather than the dot product mflop rate that Richards uses to calibrate the BSP parameters). However most of the world is message passing right now and it will be some time before that changes. It is quite feasible (and commonly practiced) to organise message passing so that aggregation is exploited; the performance problem arises with efficient implementation of the message passing library itself. That is still a problem on the T3D and on other virtual/actual shared memory systems. Obviously if you make the problem size sufficiently large you can neglect that communication inefficiency -- but not everyone can sensibly do that. It is regrettable that a number of groups have moved over to using the Cray Shmem library directly since that is of course non-portable. The BSP library is quite close to Shmem performance and is portable. The exciting aspect of the library is when you look at the native implementations - they are very simple and efficient across an amazingly wide range of architectures. In other works BSP comes close to providing a portable way of programming in the native parallel language of most parallel systems (no exceptions as yet). regards Kevin. ============================================================== Dr. Kevin Parrott, Numerical Analysis Group Oxford University Computing Laboratory Wolfson Building, Parks Road, Oxford, OX1 3QD, UK. email: kevin.parrott@comlab.ox.ac.uk / tel:+44-865-273889 fax: +44-865-273839 ==============================================================