Newsgroups: comp.lang.fortran,comp.parallel,comp.sys.super
From: marc@efn.org (Marc Baber)
Subject: APR xHPF 2.1 Released; NAS Parallel Benchmark Results
Date: Wed, 5 Jul 1995 20:34:09 GMT


APR RELEASES xHPF 2.1, WORLD'S FIRST HPF TO TURN IN NAS BENCHMARK RESULTS    
July 5, 1995
=========================================================================

  Sacramento, CA -- Applied Parallel Research (APR) announced it will
begin shipping xHPF 2.1, the latest version of its industry-leading
High Performance Fortran (HPF) compilation system.  xHPF 2.1 is the
first HPF implementation to compile NAS Parallel Benchmark (NPB)
programs and has thus set a new standard for end-user achievable
performance on a wide range of parallel platforms.

  The NPB suite is used to measure sustainable performance of computer
systems when running five computational kernels and three simulated CFD
programs.  The programs represent typical applications used in NASA's
NAS project.  The benchmarks are considered a "pathfinder" in searching
out the best parallel systems for grand challenge problems such as
modeling whole aircraft.

  APR's president, John Levesque said, "xHPF may be the only HPF
implementation capable of successfully parallelizing NAS parallel
benchmarks today.  To date, no other HPF vendor has published even a
single result for any of the of these eight benchmarks.  The speed-ups
achieved by xHPF on the [Cray] T3D, the [IBM] SP-2, and the [Intel]
Paragon are impressive enough that I believe it will be months or even
years before other HPF vendors can offer comparable performance."

  "We expect 1995 will be the watershed year for parallel programming
of distributed memory systems and clusters.  Before 1995, hand-tuned
message-passing programming was the norm.  Beginning this year,
automatic parallelization by sophisticated, production-quality HPF
compilers will be the norm and APR's xHPF is well-positioned to become
the de facto industry standard for HPF compilation.  This is a wake-up
call for application programmers who've been waiting since the early
days of the hypercubes for good parallel Fortran compilers."

  From UNI-C, The Danish Computing Center for Research and Education,
Jorgen Moth commented, "Parallelization of standard Fortran programs is
made practical for our busy scientists by FORGE Explorer and xHPF.  We
have found these tools to be a bridge between Fortran 77, Fortran 90,
and HPF, thus removing many obstacles from the exploitation of parallel
machines."

  At the Cornell Theory Center, where the largest IBM SP-2 (512 nodes)
is installed, Donna Bergmark summarized over two years of experience
with xHPF, saying, "It [APR's xHPF] has proven to be a 'quick and easy'
way to get a program to run in parallel, without having to learn a
message passing protocol."  She also noted, "At the present time, there
are on the average 500-800 invocations of xHPF per month [at the CTC]."

  With xHPF, automatic parallelization has now reached the point where
gains achievable by hand-parallelization are often not cost-effective
when the expense of re-programming is factored into the
price-performance equation.  Nonetheless, for users who demand the very
highest performance, APR provides ForgeX, an interactive Motif GUI
Fortran code browser and interactive parallelization system which is
fully compatible with xHPF.  With ForgeX, users can interactively fine
tune their parallelized codes, using their knowledge of the underlying
algorithms as well as execution timings that can be obtained with the
code instrumentation features of ForgeX.

  To underscore the importance of the latest release, during the month
of July, APR is offering free ForgeX licenses, including interactive
parallelization for distributed memory systems, for sites purchasing
xHPF licenses.  The number of concurrent interactive users is related
to the number of processors the xHPF-parallelized codes will be run
on.  Contact APR for details.

  The NAS Benchmark results include the EP, SP, BT, FT and MG
programs.  These are slightly modified versions of the standard
Fortran-77 programs from NASA supplemented with HPF directives.  While
many MPP vendors worked months on optimizing the sequential versions of
these programs to utilize cache more effectively, or to perform table
lookups for some operations, no similar restructurings were performed
with APR's versions.  Therefore, the APR versions of the NAS benchmarks
tend to be closer to end-user programs and the results obtained should
be more representative of what might be expected by the general user
community.

  The timings in the following tables were obtained using xHPF and
APR's shared memory parallelization system -- spf. With these results
APR is demonstrating the ability to maintain portable code across
varied MPP and SMP parallel systems. All of the benchmarks also run
sequentially on a uni-processor.

  The results in the tables following this article are for xHPF 2.1
(APR development version 2029) and, for shared-memory systems, spf.  As
development versions and new releases of xHPF achieve even better
results, APR will update the timings available on its web pages at
http://www.infomall.org/apri.  APR encourages other HPF vendors to
respond in kind by making their HPF benchmark results available in
their web pages accessible from the HPFF (High Performance Fortran
Forum) web page at http://www.erc.msstate.edu/hpff/home.html.

  The speedups obtained for the Fortran-77 versions of the benchmarks
highlight xHPF's superior capabilities in the area of parallelizing
DO-loops in addition to Fortran-90 array syntax and HPF FORALL
statements.  Some other HPF implementations either do not attempt to
parallelize DO-loops or do not have the robust dependence analysis
capabilities of xHPF and fail to parallelize some DO loops that are no
problem for xHPF.

  These Fortran programs were processed without modification by APR's
xHPF and code generated for the Cray T3D, IBM SP2, Digital ALPHA
Cluster, SGI Power Challenge Cluster and Intel PARAGON. Then some of
the benchmarks were processed without modification by APR's spf and
code generated for the Sun SPARCcenter 2000.

  One will notice that the IBM-SP2 does very well compared to other MPP
systems.  It is true that APR's timings get closer to the timings
supplied by IBM, but no special "tuning" was done for the SP-2.  The
superior performance can be attributed to IBM's Fortran-77 compiler
(xlf) which compiles the parallelized SPMD Fortran-77 code output by
xHPF.  Xlf is more successful at achieving maximum single processor
performance than the other vendors' Fortran-77 compilers.  The scaling
of the timings as the number of processors is increased is good on all
the platforms.

  APR is a leading supplier of software tools for Fortran program
analysis, performance measurement, parallelization, restructuring,
and dialect translation.


========  NAS Benchmark Results  ========

NOTE: The following times are between 2-10 times slower than the
timings reported by the various vendors.  The major difference is due
to the vendors' extensive rewriting of the benchmarks to obtain the
best possible single node performance.  APR has asked and will continue
to ask the vendors to supply their optimized single node versions of
the benchmarks so everyone can start with the same sequential
programs.  To date, however, all vendors have refused saying their
versions of the benchmarks are proprietary.


Benchmark SP: Simulated CFD Application
---------------------------------------------------------------------
Platform         Processors       Time(Sec.) 
---------------------------------------------------------------------
 Cray C90            1               7634. **
---------------------------------------------------------------------
 Cray T3D           16               2368.
                    32               1353.
                    64                728.
---------------------------------------------------------------------
 IBM SP2-WIDE       16                576.  
                    32                320.       
                    64                192.      
---------------------------------------------------------------------
 Intel Paragon      16               3435.     
                    32               2202.    
                    64               1257.   
---------------------------------------------------------------------
 Sun SPARCcenter     8               2382.
 2000 (40 MHz)      16               1617.
---------------------------------------------------------------------

	This benchmark was not run on the SGI PowerChallenge or the DEC
	ALPHA cluster. The SP benchmark contains several transposes
	that require a faster communication fabric.

**  C90 timings were obtained by taking the same Fortran 77 code
    that was input to xHPF and spf for the other timings and
    compiling it with cf77, with no special optimization switches.
    Though these results are below the speeds a C90 is capable of,
    the purpose here is to show simple compile-and-run performance
    for a portable code on parallel systems versus the C90, the
    widely recognized single-processor supercomputing standard.

    The C90 results are indicative of sequential scalar performance
    on the C90, since Cray's cf77 did not vectorize some of the
    major loops in the benchmarks for one reason or another.  For
    example, the EP benchmark calls a subroutine from within its
    major loop and, because cf77 doesn't provide global
    interprocedural analysis (as xHPF does), it was unable to
    vectorize the main loop.  It is interesting that xHPF can
    parallelize many loops that the best vectorizing compilers in
    the industry cannot vectorize.

    As with the parallel machines, no attempt was made to make the C90
    timings the best that could be achieved.  The objective was not to
    hit the "Macho-flop" performance ratings, but rather to indicate
    what performance the normal compile-and-run user might expect.


Benchmark EP: Embarrassingly Parallel Benchmark
---------------------------------------------------------------------
Platform         Processors       Time(Sec.)       
---------------------------------------------------------------------
 Cray C90            1                694.
---------------------------------------------------------------------
 Cray T3D           16                100.        
                    32                 50.       
                    64                 25.      
---------------------------------------------------------------------
 IBM SP2-WIDE       16                 79.     
                    32                 40.    
                    64                 23.   
---------------------------------------------------------------------
 Intel Paragon      16                251.  
                    32                126. 
                    64                 64.
---------------------------------------------------------------------
 DEC ALPHA           4                261.  
 3000/900 (275Mhz)   8                131. 
---------------------------------------------------------------------
SGI PowerChallenge   4                459.
MIPS R8000           8                233. 
                    16                116.
---------------------------------------------------------------------


Benchmark BT: Simulated CFD Application
---------------------------------------------------------------------
Platform         Processors       Time(Sec.) 
---------------------------------------------------------------------
 Cray C90            1              10615.
---------------------------------------------------------------------
 Cray T3D           16               1958.
                    32               1044.
                    64               551.
---------------------------------------------------------------------
 IBM SP2-WIDE       16                446. 
                    32                245.  
                    64                164.   
---------------------------------------------------------------------
 Intel Paragon      16               5741.  
                    32               3091. 
                    64               1809.
---------------------------------------------------------------------
 Sun SPARCcenter     8               3393
 2000 (40MHz)       16               1759
---------------------------------------------------------------------

This benchmark was not run on the SGI PowerChallenge or the DEC ALPHA
cluster. The BT benchmark contains several transposes that require a
faster communication fabric.


Benchmark FT
---------------------------------------------------------------------
Platform         Processors       Time(Sec.)  
---------------------------------------------------------------------
 Cray C90           Code Requires more memory than available
---------------------------------------------------------------------
 Intel Paragon      16               1165.
                    32               249.7
                    64               247.7
---------------------------------------------------------------------
 Cray T3D           16               279.4
                    32               192.41
---------------------------------------------------------------------
 IBM SP2-WIDE       16               104.8
                    32                67.52


This benchmark was not run on the SGI PowerChallenge or the DEC ALPHA
cluster. The FT benchmark contains several transposes that require a
faster communication fabric.


Benchmark MG
---------------------------------------------------------------------
Platform         Processors       Time(Sec.)   
---------------------------------------------------------------------
 Cray C90           Code Requires more memory than available
---------------------------------------------------------------------
 IBM SP2-WIDE       16                 17.48
                    32                 12.25
                    64                 12.07


Source code for these benchmarks can  be found via anonymous FTP to
ftp.infomall.org in subdirectory /tenants/apri/Bench or via WWW to the
URL http://www.infomall.org/apri/

Printed hardcopy of this report can be obtained by contacting:

           ___   _____    _____
__________/==|__|==__=\__|==__=\     Applied Parallel Research, Inc.
_________/===|__|=|__\=\_|=|__\=\    1723 Professional Drive
________/=/|=|__|=|__/=/_|=|__/=/    Sacramento, CA 95825
_______/=/_|=|__|==___/__|====_/_______________________________________________
______/=___==|__|=|______|=|\=\________________________________________________
_____/=/___|=|__|=|______|=|_\=\_______________________________________________
    /_/    |_|  |_|      |_|  \_\    
Voice:     (916)481-9891             E-mail:    support@apri.com
FAX:       (916)481-7924            APR Web Page: http://www.infomall.org/apri/
-------------------------------------------------------------------------------