From: Richard Miller <Richard.Miller@comlab.oxford.ac.uk>
Date: Wed Nov  2 11:41:18 1994

Here are benchmark figures for the Oxford BSP Library Version 1.1,
measured on 27.10.94.  'Native' versions of the library were tested on
each of the parallel machines below:

  RS6K  - cluster of IBM RS6000's on a token-ring
  SP1   - IBM SP1 with high-speed switch
  SGI   - Silicon Graphics Power Challenge
  T3D   - Cray T3D

In the table below,

  S     - measured sustainable processor speed (in megaflop/sec)
  P     - number of processors
  L     - superstep synchronisation cost (in flop)
  GInf  - asymptotic communication cost for a permutation (in flop/word)
  NHalf - 'half-bandwidth' data object size (in word) -- i.e. the minimum
	  number of words per remote store operation to achieve half the
	  asymptotic bandwidth.  If each bsp_store communicates n words,
	  the effective G should be GInf * (1 + NHalf/n) .


Machine         S       P         L      GInf   NHalf
-----------------------------------------------------
RS6K           17       2       67000   150      13
			4      220000   300

SP1            25       2        7400    58      24
			4       13000    59
			8       18000    60

SGI            75       2        1500     9      88
			4        2100     9

T3D            10       2         270     1.1    20
			4         230     1.2
			8         220     1.3
		       16         260     1.3
		      ...
		      512         260     1.6
		     1024         260     1.7


Benchmark of AMBER/BSP (18.10.94)
=================================

The test problem is one I have used before on smaller (<80 PE's) MIMD
systems.  It is really too small for an MPP of >100 PE's, but it's all
I could get my hands on during the one week I had access to the 1024-PE T3D.
A much bigger problem would produce much better parallel speedups on
the T3D.

Description of problem:

  . AMBER molecular dynamics (100 MD steps) on a polysaccharide in water
  . 10,704 atoms
  . nonbonded pair cutoff of 10 Angstroms (=> 2,376,022 pairs)
  . SHAKE bond-length constraint applied to all bonds

Results on Cray T3D
===================

Total elapsed time for the run in seconds (excluding disk I/O):

#PE's   Time     Speedup
  1   2239.92      1.0
  2   1165.10      1.9
  4    543.69      4.1      <-- Superlinear speedup results from
  8    276.61      8.1      <--  better cache-hit ratio
 16    142.95     15.7
 32     76.53     29.3
 64     44.30     50.6
128     27.27     82.1
256     19.92    112.4
Y/MP   223.9      ---       <-- CPU time for same problem on Cray Y/MP

To test the effect of an increase in problem size, I ran the same problem
with a nonbonded pair cutoff of 15 Angstroms (=> 7,924,551 pairs).
Since the larger problem would not fit on a single PE, the "Speedup" figures
below are relative to an estimate of single-processor runtime.

Total elapsed time for increased problem size (excluding disk I/O):

#PE's   Time     Speedup
   8   871.33      8.0
  16   449.48     15.5
  32   229.28     30.4
  64   120.16     58.0
 128    65.38    106.6
 256    38.24    182.3
 512    28.62    243.6
1024    21.40    325.7
Y/MP   602.9      ---       <-- CPU time for same problem on Cray Y/MP

Results on other BSP machines
=============================

Because the parallelisation of AMBER was based on the 
architecture-independent
Oxford BSP library, it was possible to run the program unchanged on other
parallel machines at Oxford Parallel:

   RS6K    - a cluster of IBM RS6000's on a token-ring
   SP1     - an IBM SP1 with a high-speed switch
   SGI     - a shared-memory Silicon Graphics Power Challenge

Elapsed time for the original problem on these machines:

Machine    #PE's   Time     Speedup

RS6K         1    1600.67      1
	     2     925.73      1.7
	     4     663.16      2.4

SP1          1    1077.35      1
	     2     586.55      1.8
	     4     326.26      3.3
	     8     202.47      5.3

SGI          1     431.86      1
	     2     221.97      1.9
	     4     115.58      3.7
--------------------------------------------------------------------------------


Date: Tue, 29 Aug 95 15:54:44 +0100
From: P.H.Welch@ukc.ac.uk
To: Kevin.Parrott@comlab.oxford.ac.uk, Richard.Miller@comlab.ox.ac.uk, 
Subject: BSP figures?


Dear Kevin/Richard,

Heather Liddell forwarded to me a note from Richard Miller (2/11/95,
that Kevin had forwarded to her) about BSP performance on a number of
architectures, including the Edinburgh T3D.  Please could you help
me fill in a couple of gaps in my understanding?


> ---------- Begin Forwarded Message ----------
> >From Richard.Miller@comlab Wed Nov  2 11:41:18 1994
> 
> Here are benchmark figures for the Oxford BSP Library Version 1.1,
> measured on 27.10.94.  'Native' versions of the library were tested on
> each of the parallel machines below:
> 
>   RS6K  - cluster of IBM RS6000's on a token-ring
>   SP1   - IBM SP1 with high-speed switch
>   SGI   - Silicon Graphics Power Challenge
>   T3D   - Cray T3D
> 
> In the table below,
> 
>   S     - measured sustainable processor speed (in megaflop/sec)
>   P     - number of processors
>   L     - superstep synchronisation cost (in flop)
>   GInf  - asymptotic communication cost for a permutation (in flop/word)
>   NHalf - 'half-bandwidth' data object size (in word) -- i.e. the minimum
> 	  number of words per remote store operation to achieve half the
> 	  asymptotic bandwidth.  If each bsp_store communicates n words,
> 	  the effective G should be GInf * (1 + NHalf/n) .
> 
> 
> Machine         S       P         L      GInf   NHalf
> -----------------------------------------------------
> RS6K           17       2       67000   150      13
> 			  4      220000   300
> 
> SP1            25       2        7400    58      24
> 			  4       13000    59
> 			  8       18000    60
> 
> SGI            75       2        1500     9      88
> 			  4        2100     9
> 
> T3D            10       2         270     1.1    20
> 			  4         230     1.2
> 			  8         220     1.3
> 		         16         260     1.3
> 		        ...
> 		        512         260     1.6
> 		       1024         260     1.7
> 
> ----------- End Forwarded Message -----------


I don't understand the `S' column - e.g. why is it there?  The other
columns give fundamental costs of the infrastructure that are not
dependant on application.  The `S' column gives only one figure for
all values of `P' - what is this?  It's defined as the `measured'
sustainable processor speed ... on what program is it measured?
For example, 10 megaflop/sec. for a single T3D processor is very
slow compared to around 50 megaflop/sec. obtained per processor as
reported by NASA for their `Embarassingly Parallel' benchmark??

Also, GInf is the `... cost for a permutation ...'.  Should that
be `... cost for a communication ...'?  I'm afraid my BSP is not
very good - I don't know the relevance of `permutation' here!


> Benchmark of AMBER/BSP (18.10.94)
> =================================
> 
> ...

> Description of problem:
>
>   . AMBER molecular dynamics (100 MD steps) on a polysaccharide in water
>   . 10,704 atoms
>   . nonbonded pair cutoff of 10 Angstroms (=> 2,376,022 pairs)
>   . SHAKE bond-length constraint applied to all bonds
>
> 
> Results on Cray T3D
> ===================
> 
> Total elapsed time for the run in seconds (excluding disk I/O):
> 
> #PE's   Time     Speedup
>   1   2239.92      1.0
>   2   1165.10      1.9
>   4    543.69      4.1      <-- Superlinear speedup results from
>   8    276.61      8.1      <--  better cache-hit ratio
>  16    142.95     15.7
>  32     76.53     29.3
>  64     44.30     50.6
> 128     27.27     82.1
> 256     19.92    112.4
> Y/MP   223.9      ---       <-- CPU time for same problem on Cray Y/MP


Can you tell me how many floating-point operations are executed here?
I'm afraid I want to calculate absolute performance as well as speedup.

I understand that the Oxford BSP library at present allows only one
process (i.e. superstep) per processor ... is that right?  If so,
do not Valiant's predictions about scalability require `sufficient'
parallel slackness - i.e. many more processes than processors??
May that explain low efficiencies that may be present?


> To test the effect of an increase in problem size, I ran the same problem
> with a nonbonded pair cutoff of 15 Angstroms (=> 7,924,551 pairs).
> Since the larger problem would not fit on a single PE, the "Speedup" figures
> below are relative to an estimate of single-processor runtime.
> 
> Total elapsed time for increased problem size (excluding disk I/O):
> 
> #PE's   Time     Speedup
>    8   871.33      8.0
>   16   449.48     15.5
>   32   229.28     30.4
>   64   120.16     58.0
>  128    65.38    106.6
>  256    38.24    182.3
>  512    28.62    243.6
> 1024    21.40    325.7
> Y/MP   602.9      ---       <-- CPU time for same problem on Cray Y/MP


Also, how many floating-point operations are executed here?

If you can spare the time to put me straight on these things, I'd
be very grateful.  Many thanks,

Peter Welch.
--------------------------------------------------------------------------------


Date: Fri, 1 Sep 1995 20:20:08 GMT
From: Richard Miller <Richard.Miller@comlab.oxford.ac.uk>
Subject: Re: BSP figures?


In reply to Peter Welch's BSP library questions:

	I don't understand the `S' column - e.g. why is it there?  The other
	columns give fundamental costs of the infrastructure that are not
	dependant on application.

Valiant's original BSP papers used only three architectural performance
parameters: L, g and p.  For the purpose of complexity analysis of parallel
algorithms, it is conventional to express the time costs of synchronisation
and communication in terms of 'basic computation steps', instead of in
seconds: on a machine with synchronisation time L, each processor could
perform L local computation steps in the time it takes to perform a
global synchronisation.

Of course the nature of a 'basic computation step' is application
dependent.  That's not a problem if we are just proving asymptotic
efficiency of algorithms on abstract machines.  But if we want to use the
model to predict the performance of implemented algorithms on real
hardware, we need to know what the units of measurement are.  I expect
that's why Bill McColl introduced a fourth parameter, S, which allows
us to relate computation steps to absolute time.

	The `S' column gives only one figure for
	all values of `P' - what is this?

It's the computation speed (in megaflop/sec) of each processor.  (There's
a tacit assumption that all processors run at the same speed.)

	It's defined as the `measured'
	sustainable processor speed ... on what program is it measured?

It's measured by performing a dot-product of two long vectors.  (Where
"long" means big enough to overflow the machine's first-level cache.)

	For example, 10 megaflop/sec. for a single T3D processor is very
	slow compared to around 50 megaflop/sec. obtained per processor as
	reported by NASA for their `Embarassingly Parallel' benchmark??

The dot-product-megaflop/sec number is just my arbitrary choice, for the
purpose of providing a benchmark. (In fact I believe it's a fairly good
predictor of achievable speed on real applications, based on my
experience and conversations with other users as well as people at
Cray...) But if you measure an actual computation rate of S'
megaflop/sec on your own application, then you will want to multiply the
published L and g by S'/S when doing BSP cost calculations.

	Also, GInf is the `... cost for a permutation ...'.  Should that
	be `... cost for a communication ...'?  I'm afraid my BSP is not
	very good - I don't know the relevance of `permutation' here!

Sorry, it means a global communication in which each processor is sending
data to one other processor, and each processor is receiving from one other:
i.e. the list of receivers is a permutation of the list of senders.

	> Benchmark of AMBER/BSP (18.10.94)
	> =================================
	>
	> ...

	Can you tell me how many floating-point operations are executed here?
	I'm afraid I want to calculate absolute performance as well as speedup.

I have a record of the number of floating-point operations executed on the
Y/MP (as reported by the Cray profiling software -- I can't swear to its
accuracy).  But it's buried on a tape somewhere ... give me a few days to
dig it out.

	I understand that the Oxford BSP library at present allows only one
	process (i.e. superstep) per processor ... is that right?  If so,
	do not Valiant's predictions about scalability require `sufficient'
	parallel slackness - i.e. many more processes than processors??
	May that explain low efficiencies that may be present?

The UNIX-based implementations of the library allow only one BSP process
per UNIX process -- i.e. no multithreading -- but on most versions it is
possible to start multiple UNIX processes on each processor.  (This is
architecture-dependent, though: you can't do it on the SP1 if you use
the high-speed switch directly, and you can't do it on the T3D at all).
However the performance parameters are only valid for one process per
processor: synchronisation and communication within a processor are
(usually but not always!) cheaper than between processors.

The Oxford library's asychronous assignments constitute a kind of
parallel slackness:  if you look at the formal semantics of the
programming model underlying the library (in my DPhil thesis), you
will see that the fetch and store operations become separate CSP
processes which run in parallel with the invoking process.  If each
process can do enough (log p?) fetches or stores in a superstep, the
overlap of communication and computation should give enough latency
hiding for efficient operation.

	...
	If you can spare the time to put me straight on these things, I'd
	be very grateful.  Many thanks,

	Peter Welch.

I'll try to get the AMBER numbers to you shortly.

-- Richard Miller
--------------------------------------------------------------------------------


Date: Tue, 5 Sep 1995 15:56:12 GMT
From: Richard Miller <Richard.Miller@comlab.ox.ac.uk>
Subject: Re: BSP/AMBER benchmark flop count


Further to Peter Welch's questions about the BSP/AMBER benchmark, my records
show that the Cray Y/MP reported about 25.5*10^9 flops for the whole run
(with the 10,704-atom example).  This is equivalent to the following
absolute megaflop rates for the benchmarked machines:

		CPU rate   CPU rate
	Nodes   per node   parallel

RS6000  4       16         39
Y/MP    1       114        114
SP1     8       24         126
SGI     4       59         222
T3D     256     11         1286

Note that the single-node speeds are pretty close to my published 'S'
values (obtained by a simple-minded dot-product measurement), except
in the case of the SGI Power Challenge (S=75, AMBER speed=59)
--------------------------------------------------------------------------------


Date: Thu, 7 Sep 95 18:52:08 BST
From: Kevin.Parrott@comlab.ox.ac.uk
Subject: Re: HPC/Eficiency workshop


Dear Peter,

I am sorry but I have commitments that mean that I cannot make yoour
meeting. I noted the correspondence with Richar Miller and hope that
you are in the picture as far as the technical aspects of BSP is
concerned. Some additional comments of my own.

I believe BSP is a good solution to part of the pain of parallel
programming in that it provides a simple discipline which encourages
the aggregation of remote memory access that is necessary for most
applications to be efficient on parallel systems. Our BSP
implementation does not make use of the randomisation and parallel
slackness that Valiant describes in his original paper; however many
scientific algorithms have a BSP character in that the aggregation I
mentioned can be explicitly organised by the programmer. One can also
predict with reasonable confidence how an application will perform on
a given system by making use of the BSP cost model (provided this
makes use of the application`s floating point performance rather than
the dot product mflop rate that Richards uses to calibrate the BSP
parameters).

However most of the world is message passing right now and it will be
some time before that changes. It is quite feasible (and commonly
practiced) to organise message passing so that aggregation is
exploited; the performance problem arises with efficient
implementation of the message passing library itself. That is still a
problem on the T3D and on other virtual/actual shared memory
systems. Obviously if you make the problem size sufficiently large you
can neglect that communication inefficiency -- but not everyone can
sensibly do that. It is regrettable that a number of groups have moved
over to using the Cray Shmem library directly since that is of course
non-portable. The BSP library is quite close to Shmem performance and
is portable. The exciting aspect of the library is when you look at
the native implementations - they are very simple and efficient across
an amazingly wide range of architectures. In other works BSP comes
close to providing a portable way of programming in the native
parallel language of most parallel systems (no exceptions as yet).


regards


Kevin.

==============================================================
Dr. Kevin Parrott,
Numerical Analysis Group
Oxford University Computing Laboratory
Wolfson Building, Parks Road, Oxford, OX1 3QD, UK. 
email: kevin.parrott@comlab.ox.ac.uk / tel:+44-865-273889 
fax: +44-865-273839
==============================================================