Newsgroups: comp.lang.fortran,comp.parallel,comp.sys.super From: marc@efn.org (Marc Baber) Subject: APR xHPF 2.1 Released; NAS Parallel Benchmark Results Date: Wed, 5 Jul 1995 20:34:09 GMT APR RELEASES xHPF 2.1, WORLD'S FIRST HPF TO TURN IN NAS BENCHMARK RESULTS July 5, 1995 ========================================================================= Sacramento, CA -- Applied Parallel Research (APR) announced it will begin shipping xHPF 2.1, the latest version of its industry-leading High Performance Fortran (HPF) compilation system. xHPF 2.1 is the first HPF implementation to compile NAS Parallel Benchmark (NPB) programs and has thus set a new standard for end-user achievable performance on a wide range of parallel platforms. The NPB suite is used to measure sustainable performance of computer systems when running five computational kernels and three simulated CFD programs. The programs represent typical applications used in NASA's NAS project. The benchmarks are considered a "pathfinder" in searching out the best parallel systems for grand challenge problems such as modeling whole aircraft. APR's president, John Levesque said, "xHPF may be the only HPF implementation capable of successfully parallelizing NAS parallel benchmarks today. To date, no other HPF vendor has published even a single result for any of the of these eight benchmarks. The speed-ups achieved by xHPF on the [Cray] T3D, the [IBM] SP-2, and the [Intel] Paragon are impressive enough that I believe it will be months or even years before other HPF vendors can offer comparable performance." "We expect 1995 will be the watershed year for parallel programming of distributed memory systems and clusters. Before 1995, hand-tuned message-passing programming was the norm. Beginning this year, automatic parallelization by sophisticated, production-quality HPF compilers will be the norm and APR's xHPF is well-positioned to become the de facto industry standard for HPF compilation. This is a wake-up call for application programmers who've been waiting since the early days of the hypercubes for good parallel Fortran compilers." From UNI-C, The Danish Computing Center for Research and Education, Jorgen Moth commented, "Parallelization of standard Fortran programs is made practical for our busy scientists by FORGE Explorer and xHPF. We have found these tools to be a bridge between Fortran 77, Fortran 90, and HPF, thus removing many obstacles from the exploitation of parallel machines." At the Cornell Theory Center, where the largest IBM SP-2 (512 nodes) is installed, Donna Bergmark summarized over two years of experience with xHPF, saying, "It [APR's xHPF] has proven to be a 'quick and easy' way to get a program to run in parallel, without having to learn a message passing protocol." She also noted, "At the present time, there are on the average 500-800 invocations of xHPF per month [at the CTC]." With xHPF, automatic parallelization has now reached the point where gains achievable by hand-parallelization are often not cost-effective when the expense of re-programming is factored into the price-performance equation. Nonetheless, for users who demand the very highest performance, APR provides ForgeX, an interactive Motif GUI Fortran code browser and interactive parallelization system which is fully compatible with xHPF. With ForgeX, users can interactively fine tune their parallelized codes, using their knowledge of the underlying algorithms as well as execution timings that can be obtained with the code instrumentation features of ForgeX. To underscore the importance of the latest release, during the month of July, APR is offering free ForgeX licenses, including interactive parallelization for distributed memory systems, for sites purchasing xHPF licenses. The number of concurrent interactive users is related to the number of processors the xHPF-parallelized codes will be run on. Contact APR for details. The NAS Benchmark results include the EP, SP, BT, FT and MG programs. These are slightly modified versions of the standard Fortran-77 programs from NASA supplemented with HPF directives. While many MPP vendors worked months on optimizing the sequential versions of these programs to utilize cache more effectively, or to perform table lookups for some operations, no similar restructurings were performed with APR's versions. Therefore, the APR versions of the NAS benchmarks tend to be closer to end-user programs and the results obtained should be more representative of what might be expected by the general user community. The timings in the following tables were obtained using xHPF and APR's shared memory parallelization system -- spf. With these results APR is demonstrating the ability to maintain portable code across varied MPP and SMP parallel systems. All of the benchmarks also run sequentially on a uni-processor. The results in the tables following this article are for xHPF 2.1 (APR development version 2029) and, for shared-memory systems, spf. As development versions and new releases of xHPF achieve even better results, APR will update the timings available on its web pages at http://www.infomall.org/apri. APR encourages other HPF vendors to respond in kind by making their HPF benchmark results available in their web pages accessible from the HPFF (High Performance Fortran Forum) web page at http://www.erc.msstate.edu/hpff/home.html. The speedups obtained for the Fortran-77 versions of the benchmarks highlight xHPF's superior capabilities in the area of parallelizing DO-loops in addition to Fortran-90 array syntax and HPF FORALL statements. Some other HPF implementations either do not attempt to parallelize DO-loops or do not have the robust dependence analysis capabilities of xHPF and fail to parallelize some DO loops that are no problem for xHPF. These Fortran programs were processed without modification by APR's xHPF and code generated for the Cray T3D, IBM SP2, Digital ALPHA Cluster, SGI Power Challenge Cluster and Intel PARAGON. Then some of the benchmarks were processed without modification by APR's spf and code generated for the Sun SPARCcenter 2000. One will notice that the IBM-SP2 does very well compared to other MPP systems. It is true that APR's timings get closer to the timings supplied by IBM, but no special "tuning" was done for the SP-2. The superior performance can be attributed to IBM's Fortran-77 compiler (xlf) which compiles the parallelized SPMD Fortran-77 code output by xHPF. Xlf is more successful at achieving maximum single processor performance than the other vendors' Fortran-77 compilers. The scaling of the timings as the number of processors is increased is good on all the platforms. APR is a leading supplier of software tools for Fortran program analysis, performance measurement, parallelization, restructuring, and dialect translation. ======== NAS Benchmark Results ======== NOTE: The following times are between 2-10 times slower than the timings reported by the various vendors. The major difference is due to the vendors' extensive rewriting of the benchmarks to obtain the best possible single node performance. APR has asked and will continue to ask the vendors to supply their optimized single node versions of the benchmarks so everyone can start with the same sequential programs. To date, however, all vendors have refused saying their versions of the benchmarks are proprietary. Benchmark SP: Simulated CFD Application --------------------------------------------------------------------- Platform Processors Time(Sec.) --------------------------------------------------------------------- Cray C90 1 7634. ** --------------------------------------------------------------------- Cray T3D 16 2368. 32 1353. 64 728. --------------------------------------------------------------------- IBM SP2-WIDE 16 576. 32 320. 64 192. --------------------------------------------------------------------- Intel Paragon 16 3435. 32 2202. 64 1257. --------------------------------------------------------------------- Sun SPARCcenter 8 2382. 2000 (40 MHz) 16 1617. --------------------------------------------------------------------- This benchmark was not run on the SGI PowerChallenge or the DEC ALPHA cluster. The SP benchmark contains several transposes that require a faster communication fabric. ** C90 timings were obtained by taking the same Fortran 77 code that was input to xHPF and spf for the other timings and compiling it with cf77, with no special optimization switches. Though these results are below the speeds a C90 is capable of, the purpose here is to show simple compile-and-run performance for a portable code on parallel systems versus the C90, the widely recognized single-processor supercomputing standard. The C90 results are indicative of sequential scalar performance on the C90, since Cray's cf77 did not vectorize some of the major loops in the benchmarks for one reason or another. For example, the EP benchmark calls a subroutine from within its major loop and, because cf77 doesn't provide global interprocedural analysis (as xHPF does), it was unable to vectorize the main loop. It is interesting that xHPF can parallelize many loops that the best vectorizing compilers in the industry cannot vectorize. As with the parallel machines, no attempt was made to make the C90 timings the best that could be achieved. The objective was not to hit the "Macho-flop" performance ratings, but rather to indicate what performance the normal compile-and-run user might expect. Benchmark EP: Embarrassingly Parallel Benchmark --------------------------------------------------------------------- Platform Processors Time(Sec.) --------------------------------------------------------------------- Cray C90 1 694. --------------------------------------------------------------------- Cray T3D 16 100. 32 50. 64 25. --------------------------------------------------------------------- IBM SP2-WIDE 16 79. 32 40. 64 23. --------------------------------------------------------------------- Intel Paragon 16 251. 32 126. 64 64. --------------------------------------------------------------------- DEC ALPHA 4 261. 3000/900 (275Mhz) 8 131. --------------------------------------------------------------------- SGI PowerChallenge 4 459. MIPS R8000 8 233. 16 116. --------------------------------------------------------------------- Benchmark BT: Simulated CFD Application --------------------------------------------------------------------- Platform Processors Time(Sec.) --------------------------------------------------------------------- Cray C90 1 10615. --------------------------------------------------------------------- Cray T3D 16 1958. 32 1044. 64 551. --------------------------------------------------------------------- IBM SP2-WIDE 16 446. 32 245. 64 164. --------------------------------------------------------------------- Intel Paragon 16 5741. 32 3091. 64 1809. --------------------------------------------------------------------- Sun SPARCcenter 8 3393 2000 (40MHz) 16 1759 --------------------------------------------------------------------- This benchmark was not run on the SGI PowerChallenge or the DEC ALPHA cluster. The BT benchmark contains several transposes that require a faster communication fabric. Benchmark FT --------------------------------------------------------------------- Platform Processors Time(Sec.) --------------------------------------------------------------------- Cray C90 Code Requires more memory than available --------------------------------------------------------------------- Intel Paragon 16 1165. 32 249.7 64 247.7 --------------------------------------------------------------------- Cray T3D 16 279.4 32 192.41 --------------------------------------------------------------------- IBM SP2-WIDE 16 104.8 32 67.52 This benchmark was not run on the SGI PowerChallenge or the DEC ALPHA cluster. The FT benchmark contains several transposes that require a faster communication fabric. Benchmark MG --------------------------------------------------------------------- Platform Processors Time(Sec.) --------------------------------------------------------------------- Cray C90 Code Requires more memory than available --------------------------------------------------------------------- IBM SP2-WIDE 16 17.48 32 12.25 64 12.07 Source code for these benchmarks can be found via anonymous FTP to ftp.infomall.org in subdirectory /tenants/apri/Bench or via WWW to the URL http://www.infomall.org/apri/ Printed hardcopy of this report can be obtained by contacting: ___ _____ _____ __________/==|__|==__=\__|==__=\ Applied Parallel Research, Inc. _________/===|__|=|__\=\_|=|__\=\ 1723 Professional Drive ________/=/|=|__|=|__/=/_|=|__/=/ Sacramento, CA 95825 _______/=/_|=|__|==___/__|====_/_______________________________________________ ______/=___==|__|=|______|=|\=\________________________________________________ _____/=/___|=|__|=|______|=|_\=\_______________________________________________ /_/ |_| |_| |_| \_\ Voice: (916)481-9891 E-mail: support@apri.com FAX: (916)481-7924 APR Web Page: http://www.infomall.org/apri/ -------------------------------------------------------------------------------