--------------------------------------------------------------------------
Applied Parallel Research          FORGE 90 DMP/SMP              DataSheet
--------------------------------------------------------------------------

FORGE Interactive Parallelization Tools For Distributed and Shared Memory
Multiprocessor Systems and Clusters of Networked Workstations

An interactive Fortran parallelization environment from APR

--------------------------------------------------------------------------

* Baseline FORGE Browser
------------------------
     APR's interactive parallelizers for distributed and shared memory
     systems are built upon the industry's leading interprocedural Fortran
     program browser, FORGE Baseline. This is the only tool powerful
     enough to analyze large, complex Fortran application programs for
     parallelization on both shared and distributed memory systems.  The
     Baseline Browser utilizes an innovative database capable of analyzing
     even the most convoluted 'dusty deck" program.  FORGE's database
     viewing tools provide facilities for fast reference tracing of
     variables and constants, consistency checking of COMMON blocks and
     subprogram calls, and exposing variable aliasing through COMMON and
     calls, as well as displaying COMMON block usage, data flow through
     calls, and data dependencies between routines and basic or arbitrary
     code blocks.  Baseline FORGE's  unique interprocedural database
     provides the complete, global view of a program that you need before
     you start optimizing. Additional facilities for program maintenance
     and tidy reformatting, and an advanced instrumentation module and
     runtime library for gathering serial execution performance statistics
     are also included.

* The Distributed Memory Parallelizer (DMP)
-------------------------------------------
     Add onto Baseline the Distributed Memory Parallelizer to spread loops
     and distribute data arrays for MIMD architectures interactively.  The
     parallelized program is fully scalable, with calls to APR's parallel
     run- time library, interfacing any of the popular communication
     packages such as PVM, Express, and Linda, or native message passing
     systems.  With FORGE's SPMD (Single Program, Multiple Data)
     parallelization strategy, the same program runs on each processor
     while selected DO loops are rewritten to automatically distribute
     their iterations across the processors.  Your first step in
     parallelizing a Fortran program is identifying the critical data
     arrays, proposing an array decomposition scheme, and then
     restructuring the program to decompose these arrays over the
     processors.  FORGE DMP's Data Decomposition facility offers you an
     interactive way to specify decompositions and select arrays for
     partitioning while viewing the implications of these decisions. The
     Data Decomposer implements either BLOCK or CYCLIC distributions along
     any single dimension, with either FULL or SHRUNK memory allocation.
     With full allocation, an array is allocated its original size on each
     processor.  With shrunk allocation, each processor is allocated only
     enough memory for an array to hold the elements that it owns.

     The next step is identifying which loops to parallelize.  DMP's Loop
     Spreader allows interactive or automatic selection of loops. Under
     automatic selection, DMP uses actual runtime execution statistics to
     determine the best loops to parallelize to obtain higher
     parallelization granularity and reduced communication costs.

     DMP checks for parallelization inhibitors, rewrites dimension
     declarations and array subscripts on distributed arrays to reflect the
     partitioning, modifies DO loop control counters to operate dynamically
     depending on the loop distribution scheme, and insures that all
     restructurings are consistent through subroutine calls. Data
     communication calls to APR's parallel runtime library are inserted
     automatically around and within distributed loops to move the data as
     it is needed.  DMP's interactive displays allow fine tuning of the
     communications. The resulting parallel program is dynamically scalable
     at runtime.

     DMP can also be used to interactively view the parallelizations
     developed by APR's batch parallelizing pre-compilers dpf and xhpf.

* Parallel Performance Profiler and Simulator
---------------------------------------------
     Programs parallelized by FORGE DMP can utilize APR's Parallel
     Profiler to gather runtime performance statistics of CPU utilization
     and communication costs.  The Performance Simulator can be used to
     predict performance on various MPP systems or configurations.
     Performance instrumentation options generate parallelized programs
     with calls to APR's runtime timing library to accumulate data on each
     node for loop and subprogram execution times, communication costs,
     and program wait times.  The post-processor polytime is provided to
     analyze the results over all nodes and produce a composite report of
     a program's true performance on the parallel system.  By linking with
     APR's runtime simulation library, performance of a parallelized
     program running on a single node can be extrapolated to report CPU
     and communication performance on a variety of scalable MPP systems.

* The Shared Memory Parallelizer (SMP)
--------------------------------------
     Another add-on to Baseline FORGE is the Shared Memory Parallelizer.
     Unlike parallelizing compilers that fail to parallelize the most
     important DO loops in your program, SMP's interprocedural analysis
     can handle loops that call subroutines.  SMP's strategy is to
     parallelize for high granularity by analyzing outermost loops first.
     It analyzes array and scalar dependencies across subprogram
     boundaries by tracing references through the database up and down the
     call tree.  The result is a parallelized source code with
     compiler-specific directives inserted for scoping variables and for
     identifying Critical and Ordered regions of code.  DO loops are
     selected for parallelization interactively.  Using execution
     performance timings as a guide, FORGE SMP will suggest the most
     significant loop as a starting point, working through the code from
     the highest CPU-intensive loops down to some threshold, below which
     parallelization does not produce a performance gain.  SMP's
     interprocedural analysis makes scoping of variables passed through
     subprogram calls and COMMON possible.  In a parallel region of code,
     SMP analyzes all variable references within a loop, including those
     enclosed in routines called from the loop.  Proceeding down the call
     chain, SMP identifies variables as PRIVATE or SHARED, and GLOBAL or
     LOCAL, displaying them interactively and allowing you to modify its
     decisions.  SMP also identifies Critical and Ordered Regions in the
     code that will give rise to synchronization calls.  On some systems
     these regions cannot be parallelized.  These are also displayed
     interactively.  Following successful analysis of a loop nest, FORGE
     SMP inserts directives that are specific for the target system and
     compiler on which the program is to be run.  SMP knows about a number
     of shared memory systems and always generates the correct
     directives.  And, your program's parallel analysis is saved and can
     be recalled again later to generate a parallel program for some other
     target system.


------------------
Other APR Products
------------------

 forgex       FORGE Explorer Motif GUI global Fortran program browser

APR offers three MAGIC Parallelizing Batch Pre-Compilers:

 dpf  for distributed memory systems

 spf  for shared memory systems

 xhpf for HPF directives and Fortran 90 array syntax on 
          distributed memory systems


---------------------
Platforms and Targets 
---------------------
                    APR's products are available to run on various
systems including HP, SUN, IBM RS/6000, DEC Alpha, and Cray.
Parallelizations and runtime support are available for: workstation
clusters, IBM SP1 and POWER/4, Intel Paragon, nCUBE, Meiko, Cray T3D,
CM-5.


----------------
More Information 
----------------
               For further information on these tools and our
parallelization techniques training workshops, contact us at:  Applied
Parallel Research, Inc.  550 Main Street, Suite I Placerville, CA 95667
Phone: 916/621-1600 Fax: 916/621-0593 email: forge@netcom.com


Copyright * 1993 Applied Parallel Research, Inc.                  11/93