PTLIB Recommended Reading
Contribute to this page
Contents of this page
Parallel debugging and performance analysis
Books
- ed. Margaret Simmons, Ann Hayes, Jeffrey Brown, and Daniel Reed.
Debugging and Performance Tuning for Parallel Computing Systems.
IEEE Computer Society, 1996.
Papers
- R. Wismuller, M. Oberhuber, J. Krammer, and O. Hansen.
Interactive debugging and performance analysis of massively parallel
applications. Parallel Computing 22(3), March 1996, pp. 415-442.
- This paper describes a tool environment for the debugging and performance
tuning phase of parallel program development. The environment includes
a parallel debugger (DETOP), a performance analyzer (PATOP), and a common
distributed monitoring system, and is available for Parsytec's PowerPC
based parallel computing systems.
- John May and Francine Berman.
Creating Views for Debugging Parallel Programs.
Proc. SHPCC'94, Knoxville, TN, May 1994.
- This paper describes the features of the Panorama parallel debugger
that allow programmers to implement new graphical views of a program's
state and augment existing ones. The user part of Panorama runs
on a sequential computer and gathers information from the debugger
that is running on the parallel machine and is supplied by the
hardware vendor. New views are written in Tcl/Tk code, some of which
is automatically generated by the Panorama Display Builder tool.
When the debugger is running, a view can be updated manually by the
user pushing a button or automatically by callbacks that have been registered
with the Panorama run server.
See also.
- Doreen Cheng and Robert Hood.
A Portable Debugger for Parallel and Distributed Programs.
Proc. Supercomputing '94.
- The goal of the p2d2 project at NASA Ames is to build a debugger that
supports debugging parallel and distributed computations at the source
code level and that is portable across a variety of platforms and
programming models. The strategy of the p2d2 design is to use a
client-server model to partition the debugger. The debugger server
contains the platform-dependent code while the debugger client contains
the portable user-interface code. It is intended that the server be
implemented by the platform vendor and the client by the vendor or a
user community or other third party. In the prototype p2d2 implementation,
the debugger server is based on gdb. The p2d2 project has defined the
following three interaction protocols:
- the debugger server (Ds) protocol that specifies how the user
interface client can access debugging abstractions provided by
the server (e.g., setting a BreakpointTrap by registering a
callback function to be executed when a specific code location
is reached)
- the communication library protocol for interaction between
a message passing library such as PVM and the debugger.
The protocol specifies a collection of debugger library entry
points and requires the communication library to register
run-time events.
- the distribution preprocessor protocol which gives the debugger
access to information about how data structures in an HPF program
are distributed and aligned
For debugging in a heterogeneous distributed computing environment,
the distribution manager component, which executes on the client
machine along with the user interface component, handles the mapping
of user requests into operations provided by a remote set of
debugger servers and handles the aggregration of responses.
See also.
- D. Abramson and R. Sosic. A Debugging Tool for Software Evolution.
CASE-95, 7th International Workshop on Computer-Aided Software Engineering
Toronto, Ontario, Canada, July 1995.
- GUARD (Griffith University Parallel Debugger) is a debugging tool which
addresses debugging problems in the context of evolving software from
existing programs or of porting software from one system to another,
for example porting a vector code to a parallel computer.
Guard facilitates comparison of a new program with an existing reference
version which is assumed to produce correct results. Guard requires
the user to specify key points in the two programs at which various
data structures should be equivalent within a predefined tolerance.
Errors in the new program are then located automatically by runtime
comparison of the internal state of the new and reference versions.
Guard makes no assumptions about the control flow of the two programs,
which may be different. The programs may also be written in different
languages. Guard is aimed at scientific and numerical computing and
currently handles variables of base data types integer,
real, character, and multidimensional arrays of these base types.
A visualization system uses pixel maps to show graphically when corresponding
values of arrays are not the same, thus allowing visualization of
large data sets present in numeric programs.
Guard works in a distributed, heterogeneous environment, and it is
possible to execute the reference code, the program being debugged, and
Guard on three different computer systems connected by a network.
Current work includes extending Guard to support conversion of
existing sequential applications into data parallel language form.
See also.
-
Kranzmuller, Grabner, and Volkert.
Event Graph Visualization for Debugging Large Applications.
Proc. SPDT'96: ACM SIGMETRICS Symposium on Parallel and Distributed Tools,
May 1996, Philadelphia, pp. 108--117.
- This paper describes ATEMPT (A Tool for Event ManiPulaTion) which is
part of the MAD (Monitoring And Debugging) environment for
debugging and performance analysis of message passing programs on
distributed memory machines. ATEMPT provides error detection facilities
for quickly detecting communication errors such as isolated send and
receive events and corresponding message events with different message
length. ATEMPT also includes functionality for detecting race conditions,
for reordering race condition events, and for simulating the modified
event graph. Another feature of ATEMPT is the generation of a "cut",
which is a set of breakpoints on more than one process. ATEMPT includes
a source code connection for relating a graphically displayed event symbol
to the line of code which generated the event. For handling large
applications, ATEMPT provides horizontal (time-axis) abstraction -- e.g.,
representation of a communication pattern with one symbol -- and vertical
(process-axis) abstraction -- e.g., grouping of processes.
See also.
- Vikram S. Adve and John Mellor-Crummey and Mark Anderson and Ken Kennedy
and Jhy-Chun Wang and Daniel A. Reed.
An Integrated Compilation and Performance Analysis Environment for
Data Parallel Programs.
Proc. Supercomputing '95.
- This paper describes work on integrating data parallel fortran compilers
(Fortran 77D and HPF) with the Illinois Pablo performance analysis
environment. The Fortran 77D compiler translates a Fortran 77 program
annotated with data layout directives into an explicitly parallel SPMD
message-passing code. The compiler also performs complex optimizing
transformations. In the integrated system, the FortranD compiler decides
where instrumentation should be inserted in the synthesized message-passing
code and inserts calls to the appropriate Pablo routines.
The compiler also records static mapping information about how source
code constructs relate to one another and to corresponding constructs
in the compiler-synthesized code. Dynamic records generated during an
instrumented procedure call include the tag of a static data correlation
record. The FortranD editor graphically presents the measured computation
and communication costs of individual procedures and loops in the source
code program, as well as showing dependences that cause communication.
Information about remote data access patterns can also be collected,
although at a high cost of dynamically recording the bounds of array
sections communicated in each message, but the authors are investigating
how to reduce this overhead. The dPablo browser is an off-line
performance tool that displays performance metrics and array reference
patterns computed over user-specified code regions and execution intervals.
The data correlation software and dPablo interface have been ported
to the Portland Group HPF compiler.
- Margaret R. Martonosi and David Ofelt and Mark Heinrich.
Integrating Performance Monitoring and Communication in Parallel Computers.
ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems,
May 1996.
- Because of significant relative increases in memory latencies compared
to processor speeds, memory system behavior is of great importance
to application performance. The authors argue that hardware support
for memory performance monitoring is needed to asses program behavior
accurately and efficiently. They observe that mechanisms used to
implement cache coherence are similar to what is desired for performance
monitoring. Current hardware monitors (e.g., hardware miss counters)
are efficient but difficult to trigger selectively or have actions
other than aggregate counting occur. Software-based monitors
can categorize miss counts according to the code and data structures
that caused them but have considerably higher overhead.
Within many cache-coherent shared-memory multiprocessors, the three
components needed for performance monitoring -- trigger points,
handlers, and state storage -- have already been implemented
for the purpose of cache coherence support. Furthermore, memory
events that cause major delays are almost always those requiring
coherence actions. As an example of integrating performance monitoring
with coherence support, the authors have implemented Flashpoint,
a performance monitoring tool for the FLASH multiprocessor.
Flashpoint is implemented as a modified cache coherence protocol.
Flashpoint maintains per-data-structure, per-procedure statistics.
- William Gropp.
An Introduction to Performance Debugging for Parallel Computers.
Technical report MCS-P500-0295, Argonne National Lab, April 1995
- This paper describes how to analyze a distributed-memory message-passing
program for performance and discusses techniques for fixing performance
problems. The dominant cost for most computations is the number and
kind of memory reference. A scalability analysis, which is an analytic
estimate of expected performance as a function of the number of processes,
is used to estimate the performance of a parallel code. Then tools
are used to observe the per-node and parallel performance, in order
to identify the parts of the code that are under-performing.
Code should be tuned first for per-node performance and then for
message-passing or remote memory performance.
-
Barton P. Miller, Mark D. Callaghan, Jonathan M. Cargille, Jeffrey K.
Hollingsworth, R. Bruce Irvin, Karen L. Karavanic, Krishna
Kunchithapadam, and Tia Newhall.
The Paradyn Parallel Performance Measurement Tools.
IEEE Computer28:11, November 1995, pp. 37-46.
Special issue on performance evaluation tools for parallel and
distributed computer systems.
-
Paradyn is a tool for measuring the performance of large-scale parallel
programs. The goal is to provide detailed, flexible performance
information without incurring the space and time overhead typically
associated with trace-based tools. Paradyn achieves this goal by
dynamically instrumenting the application and automatically controlling
the instrumentation in search of performance problems.
Paradyn also provides decision support for the user by helping to decide
when and where to insert instrumentation, automatically locating performance
bottlenecks, and explaining performance bottlenecks
using descriptions and visualizations. Paradyn maps performance
data to multiple layers of abstraction, and the user can choose to
look at it in terms of high-level language constructs or low-level
machine structures. Paradyn configuration language (PCL) files
specify metrics, instrumentation, visualizations, and daemons
(default provided, user may modify or replace).
The Paradyn Instrumentation Manager encapsulates architecture-specific
knowledge and translates abstract instrumentation specifications into
machine-level instrumentation. Instrumentation instructions are
transferred to the application's binary image by a variation of the
Unix ptrace interface. Performance data from external sources,
such as hardware-based counters, can also be integrated.
Paradyn provides basic visualizations for time histograms, bar charts,
and tables and provides a library and remote procedure call interface for
incorporating external data visualization packages (e.g., AVS,
ParaGraph or Pablo displays).
- Daniel Reed and Keith Shields and Will Scullin and Luis Tavera and
Christopher Elford.
Virtual Reality and Parallel Systems Performance Analysis.
IEEE Computer 28(11), November 1995, 57--67.
- Jerry Yan and Sekhar Sarukkai and Pankaj Mehra.
Performance Measurement, Visualization and Modeling of Parallel and
Distributed Programs using the AIMS toolkit.
Software Practice and Experience 25(4), April 1995, 429--461.
- The authors maintain that performance tuning tools should not be
closely coupled to a specific architecture or particular system software,
but that the architecture-specific components of a toolkit should be
isolated and interfaces to them clearly defined to enhance portability.
AIMS is a toolkit for tuning and predicting performance of message-passing
programs on both tightly and loosely-coupled multiprocessors.
Instrumented source code is executed (with help of a machine-specific
monitor) to generate a tracefile which is run through an intrusion
compensation module before being input to various post-processing
tools. The AIMS Xinstrument source code instrumentor is written
using the Sage compiler frontend library and handles large scientific
applications with complex directory structure that may include a
combination of C and Fortran modules. The instrumentor uses
Sage flow analysis to support tracking of data structure movements.
The performance monitoring library is responsible for collecting
data, flushing traced events to a tracefile, and generating a data
structure lookup table used by the post-processing tools to present
performance information in terms of source-level data structures.
- G.A. Geist, James Kohl, and Philip Papadopoulos.
Visualization, Debugging, and Performance in PVM.
Debugging and Performance Tuning for Parallel Computing Systems
ed. Simmons, Hayes, Brown, and Reed,
IEEE Computer Society Press, 1996.
- XPVM is a graphical user interface for PVM that includes a control
console, a real-time performance monitor, a call level debugger, and
post-mortem analysis. A space-time view represents each PVM task by
a bar with different colors representing different states and represents
messages as arrows between bars. The utilization view is a graph of the
number of tasks versus time and shows the number of tasks computing,
comminicating, and idle at each moment. The tracefile written by XPVM
is in SDDF format and thus may be interpreted by other performance
analysis tools such as Pablo. XPVM is written in Tcl/Tk, which makes
it easy to change, customize, and extend the PVM interface.
A PVM debugger interface has been developed so that debugger
developers can integrate their products into the PVM environment
without modifying the PVM source. A debugger can register with PVM
as the task starter and controller for a host, and PVM then defers
all task spawning requests to the debugger. Debuggers can also
attach to running PVM tasks.
-
Krishna Kunchithapadam and Barton P. Miller.
Integrating a Debugger and a Performance Tool for Steering.
in Debugging and Performance Tuning for Parallel Computing Systems,
ed. Simmons, Hayes, Brown, and Reed,
IEEE Computer Society Press, 1996.
- Steering involves the optimizatin and control of the performance of
a program using mechanisms external to the program. This paper presents
the design of a generic steering tool that uses mechanisms already
present in some performance measurement tools and in symbolic breakpoint
debuggers. The programmer uses the results of performance analysis
to decide on an optimization. The steering tool then invokes a symbolic
debugger to allow the programmer to modify the program to effect the
steering changes. Examples of steering include loop convergence control
in iterative numberical computations, adjusting load balance and degree
of parallelism, array redistribution, and changing algorithms
(e.g., by loading different dynamic libraries). The steering configuration
described in this paper uses the Paradyn tool, which uses dynamic
instrumentation together with an automated search mechanism to identify
performance bottlenecks, as the performance analysis tool.
The steering tool can be used for either interactive or automated steering.
-
Peter Buhr and Martin Karsten and Jun Shih.
KDB: A Multi-threaded Debugger for Multi-threaded Applications.
Proc. SPDT'96: ACM SIGMETRICS Symposium on Parallel and Distributed Tools,
May 1996, Philadelphia, pp. 80--87.
- The purpose of the KDB project is to demonstrate, using the uC++ shared
memory thread library, that is possible to build powerful and flexible
debugging support for user-level threads. Use of the synchronous UNIX
debugging primitives /proc and ptrace precludes independent and asynchronous
control of individual threads within a UNIX process and restricts a
distributed debugger to independent control of one thread per UNIX process.
KDB is a concurrent debugger that runs on symmetric shared-memory
multiprocessors and that supports the uC++ execution environment, which
shares all data and has multiple kernel and user-level threads.
Using KDB, the programmer can individually control and examine each
user-level thread. Alternatively, tasks can be grouped together and
operations issued on the group.
-
Automatic Performance Prediction to Support Cross Development of Parallel
Programs.
Matthias Schumann.
Proc. SPDT'96: ACM SIGMETRICS Symposium on Parallel and Distributed Tools.
May 1996, Philadelphia, pp. 88--97.
- This paper presents a performance prediction methodology for supporting
cross development (e.g., from
loosely coupled to tightly coupled multiprocessors)
of deterministic message-passing programs. In the abstract parallel
numerical machine (APNM) model, the parameters of the machine model for
each considered systems are automatically determined by a machine analyzer
based on a micro benchmark. The workload parameters of the target
programs are automatically determined by a static program analyzer/instrumentor
tool, and in a profile run or execution-driven simulation. Memory access
behavior of a program is modeled by heuristics that attempt to determine
the target of a memory access by looking at the source code. A piecewise
linear model of communication latencies accounts for discontinuities
caused by packetization and for the effect of the number of participants
in broadcasts and global synchronizations. The model ignores network
topology and contention. The PE performance predictor tool assembles a
weighted task graph representing the program's execution behavior on the
target machine and performs a critical path analysis upon the resulting
graph. PE provides a machine independent characterization of program
execution by visualizing the execution count profile of abstract HLL
instructions, and a machine dependent prediction of execution time.
Optionally, PE can generate a trace file for use with the TATOO performance
analysis tool.
Books
- Marc Snir, Steve Otto, Steven Huss-Lederman, David Walker, and Jack
Dongarra.
MPI: The Complete Reference Volume 1, 2nd Edition.
MIT Press, 1998.
- William Gropp, Steven Huss-Lederman, Andrew Lumsdaine, Ewing Lusk, Bill
Nitzberg, William Saphir, and Marc Snir.
MPI: The Complete Reference Volume 2. MIT Press, 1998.
-
William Gropp, Ewing Lusk, and Anthony Skjellum.
Using MPI: Portable Parallel Programming with the Message-Passing
Interface.
MIT Press, 1994.
- Al Geist, Adam Beguelin, Jack Dongarra, Weicheng Jiang, Bob Manchek,
and Vaidy Sunderam.
PVM: Parallel Virtual Machine--A User's Guide and Tutorial for
Network Parallel Computing.
MIT Press, 1994.
Papers
- Oliver A. McBryan.
An overview of message passing environments.
Parallel Computing 20(4), April 1994, 417-444.
-
Andrew S. Grimshaw and Jon B. Weissman and W. Timothy Strayer.
Portable Run-Time Support for Dynamic Object-Oriented Parallel Processing.
ACM Trans. on Computer Systems 14(2), May 1996, 139-170.
-
Mentat is an object-oriented parallel processing system. The Mentat
compiler and run-time system work together to automatically manage
the communication and synchronization between objects. This paper
describes the Mentat run-time system, which marshalls member function
arguments, schedules objects on processors, and dynamically constructs and
executes large-grain data dependence graphs.
- Adam Beguelin and Jack Dongarra and Al Geist and Vaidy Sunderam.
Recent Enhancements to PVM.
International Journal for Supercomputer Applications 9(2), Summer 1995.
-
This paper describes new features introduced in version 3 of PVM.
A new communication function pvm_psend() combines the initialize, pack,
and send steps for sending a message into a single call in order to
improve performance. The complementary routine pvm_precv() combines
unpack and blocking receive. Another new communication function in
PVM 3.3 is a timeout version of receive. Two data encoding options
are available in PVM 3.3 to boost communication performance:
1) the user can tell PVM to scip the XDR encoding step if the message
will only be sent to hosts with compatible data format,
2) the inplace option which causes data to be copied directly from
user memory to the network rather than be packed into a buffer.
In addition to the already existing broadcast to a group of tasks
and barrier across a group of tasks, PVM 3.3 adds three new
collective communication routines: 1) pvm_reduce, with predefined
reduce functions for global max, min, sum, and product, and with
which user-defined reduce functions may be used, 2) pvm_gather,
and 3) pvm_scatter.
New interfaces in PVM 3.3 allow PVM tasks to take over functions
normally performed by the pvm daemons: starting hosts and tasks, and
making scheduling decisions. These interfaces have been used in
colloborative work with the Condor and Paradyn projects.
New shared memory optimizations implement
data movement with a shared buffer pool and lock primitives.
A new tracing feature can be used to generate events for PVM calls.
- Greg Burns and Raja Daoud and James Vaigl.
LAM: An Open Cluster Environment for MPI.
- Trollius is a multicomputer operating environment that runs natively
on dedicated parallel machine nodes (In The Box, or ITB, nodes) and
as a Unix daemon on general purpose (Out of The Box, or OTB, nodes).
LAM is a subset of Trollius that runs only on a network of UNIX machines
acting as OTB nodes. LAM has been used to implement PVM, MPI, and
a tuple-based tool (although some features of PVM were not implemented).
LAM has a powerful message buffering system that supports non-blocking
message send, overlapped communication, and debugging. Processes can
organize into groups, and group membership is dynamic. A full range of
collective communication functions are available. LAM supports a
loosely synchronous parallel I/O access method based on Cubix.
A separate paper by Nick Nevin (OSC-TR-1996-4) compares the performance
of LAM 6.0 and MPICH 1.0.12 on a workstation cluster for a suite
of six benchmark programs consisting of ping, ping-pong, barrier,
broadcast, gather, and alltoall. LAM 6.0 is faster for all benchmarks
for messages up to 8192 bytes in length, which is where LAM switches
over to a long message protocol (MPICH default is to change protocols
at 16384 bytes).
- Tom Loos and Randall Bramley.
MPI Performance Comparison on Distributed and Shared Memory Machines.
MPI Developers Conference, University of Notre Dame, July 1996.
Books
-
Michael Wolfe. High Performance Compilers for Parallel Computing.
Addison-Wesley, 1996. ISBN 0-8053-2730-4
- See
http://www.cse.ogi.edu/Sparse/book.html for more information.
- Hans Zima and Barbara Chapman.
Supercompilers for Parallel and Vector Computers
ACM Press, 1991. ISBN 0-201-17560-6
- See
http://www.acm.org/catalog/books/homepage.html for more information.
- Utpal Banerjee.
-
- Loop Transformations for Restructuring Compilers.
January 1993.
- Loop Parallelization. April 1994.
- Dependence Analysis. October 1996.
See Kluwer Academic Publishers
for more information.
Papers
-
I. Foster, D. Kohn, R. Krishnaiyer, and A. Choudhary.
Double Standards: Bringing Task Parallelism to HPF via the Message
Passing Interface.
Proc. Supercomputing '96.
- This paper describes the HPF/MPI library (an HPF binding for MPI),
which is a coordination
library for allowing expression of mixed task/data-parallel
computations and the coupling of separately compiled data-parallel
modules. The HPF/MPI library supports an execution model in
which disjoint process groups (corresponding to data-parallel tasks)
interact with each other by calling group-oriented communication functions.
This approach is in contrast to compiler-based approaches that atempt
to identify task-parallel structures automatically and to language-based
approaches that provide new language structures for specifying task
parallelism explicitly.
The HPF/MPI library includes techniques for determining data distribution
information and for communicating distributed data structures efficiently
from sender to receiver.
Results from two-dimensional FFT, convolution, and
multiblock programs demonstrate that the HPF/MPI library can
provide performance superior to that of pure HPF.
The HPF/MPI prototype uses the HPF compiler produced by the Portland
Group, Inc. and the MPICH implementation of MPI.
- David F. Bacon, Susan L. Graham, and Oliver J. Sharp.
Compiler Transformations for High-Performance Computing.
ACM Computing Surveys 26(4), December 1994, 345-420.
-
This survey describes transformations that optimize programs written
in imperative languages such as Fortran and C for high-performance
architectures, including superscalar, vector, and various classes of
multiprocessor machines. Optimizations for high-performance
machines maximize parallelism and memory locality with transformations
that rely on tracking the properties of arrays using loop dependence
analysis. Communication optimizations for distributed memory
machines include message vectorization and aggregation, collective
communication, overlapping communication and computation, and
elimination of redundant communication.
-
B. Rau and J. A. Fisher.
Instruction-level parallel processing: History, overview, and perspective.
J. Supercomputing 7(1/2), May 1993, 9-50.
-
Banerjee, et.al. The Paradigm Compiler for Distributed Memory Multicomputers.
IEEE Computer 28(10), October 1995, 37-47.
-
Guy E. Blelloch
Programming Parallel Algorithms
Communications of the ACM 39(3), March 1996, 85-97.
-
S.P. Johnson and C.S. Ierotheou and M. Cross.
Automatic parallel code generation for message passing on distributed
memory systems.
Parallel Computing 22:2 (Feb 1996) 227-258.
-
This paper describes algorithms and a toolkit (CAPTools) for automating as much
as possible of the parallelization process for mesh based numerical codes,
with a focus on structured mesh computational mechanics software,
including for example CFD, heat transfer, structural analysis,
electromagnetics, and acoustics.
The authors argue that to satisfy the needs of application code authors,
the parallelization should commence from scalar code and perform mainly
automated parallization. User interaction is considered essential
to produce a very accurate dependence analysis of the code.
The authors contrast their approach with that of High Performance Fortran
(HPF) and describe a number of factors that inhibit the success of
HPF compilers. A companion article on CAPTools (in the same issue
of Parallel Computing) gives performance results for a 2-D diffusion code,
the APPLU code from the NAS-PAR benchmark suite, the TEAMKE1 2-D CFD code,
and a 3-D CFD code. The authors claim that use of CAPTools allows
the parallelization process to be completed in a fraction of the time
required for a manual effort while achieving parallel code as efficient
as could be achieved by hand. Other companion articles (also in the
same issue of Parallel Computing) describe the interprocedural
dependence analysis techniques used in CAPTools and the CAPTools user
interface.
-
Joy Reed, Kevin Parrott, and Tim Lanfear.
Portability, predictability and performance for parallel computing:
BSP in practice.
November 1995.
- This paper reports on practical experience using the Oxford BSP Library
to parallelize a large electromagnetic code, the British Aerospace
finite-difference time-domain code EMMA T:FD3D. The Oxford BSP Library
is an implementation of the Bulk Synchronous Parallel computational model
targeted at numerically intensive scientific (typically Fortran)
computing. BSP cost-modeling techniques can be used to predict and
optimize performance for single-source programs across different parallel
platforms. Predicted and observed performance results are given for
the BAe EMMA code on an 8-processor IBM SP1, aa 4-processor SGI Power
Challenge L, and a 128-processor Cray T3D.
The authors claim that the BSP approach can achieve good application
performance for numerically-intensive codes across a complete range of
parallel architectures with only a single maintainable code.
-
Cherri Pancake.
Is Parallelism for You?
IEEE Comp. Sci. & Eng. 3(2), Summer 1996, 18--37.
-
This article gives a good introduction to parallel machine architectures,
parallel programming models, and problem "architectures", and how
the three match up. The article gives guidelines, or "rules of thumb",
for figuring out which problem category an application fits into,
and whether effort in parallelizing the application will be worth the cost.
The author discusses how to measure a serial program's potential
parallel content and calculate its theoretical speedup, including
consideration of the effect of problem size on theoretical speedup.
Factors that can cause observed sppedup to differ from theoretical
speedup are discussed. To illustrate the rules of thumb, the article
includes an example of analyzing and parallelizing a volume renderer
application.
Papers
-
Rosenblum, Chappin, Teodosiu, Devine, Lahiri, and Gupta.
Implementing Efficient Fault Containment for Multiprocessors:
Confining Faults in a Shared-Memory Multiprocessor Environment.
-
The performance advantages of future large-scale shared-memory
multiprocessors may be offset by reliability problems if they run
current operating systems. With current multiprocessor operating
systems, the entire machine must be rebooted when a nontrivial hardware
or software fault occurs. The effect is to terminate all applications
even though the initial fault may have affected only a few of the
applications. According to the authors, the key to improving the
reliability of shared-memory multiprocessors is to provide fault containment.
This article describes the authors' experience with implementing
efficient fault containment in a prototype operating system called Hive.
Hive is a version of Unix that is targeted to run on the Stanford
FLASH multiprocessor.
See also http://www-flash.stanford.edu/Hive/
-
Wood and Hill. Cost-Effective Parallel Computing.
IEEE Computer, February 1995.
-
The authors' thesis is that a parallel system provides better cost/performance
than uniprocessors whenever speedup exceeds costup, which is the parallel
system cost divided by the uniprocessor cost. When applications have large
memory requirements (e.g., 512 Mbytes),
the costup can be far less than linear, because
parallelizing a job on p processor rarely multiplies the memory requirements
by p. Thus when memory cost is a significant fraction of system cost,
parallel systems can be cost-effective at modest speedups.
-
Heinlein, Gharachorloo, Dresser, and Gupta.
Integration of Message Passing and Shared Memory in the Stanford
FLASH Multiprocessor.
Proceedings of the 6th International Conference on Architectural Support for
Programming Languages and Operating Systems, pages 38-50, San Jose, CA,
October 1994.
-
This paper present hardware and software mechanisms in FLASH to support
various message passing protocols and to handle interaction of the
messaging protocols with virtual memory, protected multiprogramming,
and cache coherence.
The FLASH custom programmable node controller, called MAGIC (Memory
and General Interconnect Controller), can be programmed to implement
both cache coherence and message passing protocols.
Message passing protocols are supported on the controllers to
keeping the overhead on the compute processors low.
Message passing latency and overhead are further reduced by providing
protected user-level access to the programmable controllers.
Data are transferred as a series of cache lines. The flexibility
of FLASH allows support for either local or global cache coherence
and the interaction of message data with cache coherence.
The MAGIC chip can also be used to support active message style operations
that require computation to be invoked at remote nodes.
Simulation studies indicate that the system can sustan message transfer
rates of several hundred megabytes per second.
-
Holt, Heinrich, Singh, Rothberg, and Hennessy.
Technical Report CSL-TR-95-660, Computer Systems Laboratory, Stanford
University, January 1995.
-
Chandra, Gharachorloo, Soundararajan, and Gupta.
Performance Evaluation of Hybrid Hardware and Software Distributed
Shared Memory Protocols.
Proceedings of the 8th ACM International Conference on Supercomputing,
pp. 274-288, Manchester, England, July 1994.
- David Kotz.
Applications of Parallel I/O.
Dartmouth Technical Report PCS-TR96-297
- The bibliography for this technical report is also available in
BibTeX and
HTML,
with links to many of the cited papers.
- P. Brezany and A. Choudhary.
Techniques and Optimizations for Developing Irregular
Out-of-Core Applications for Distributed-Memory Systems.
Institute for Software Technology and Parallel Systems, University of Vienna,
TR 96-4. November 1996.
-
This paper presents techniques for implementing Out-Of-Core irregular
problems. In particular we present a design for a runtime system to
efficiently support out-of-core irregular problems. Furthermore, we describe
the appropriate transformations required to reduce the I/O overheads for
staging data as well as for communication. Several optimizations are described
for each step of the parallelization process. The proposed techniques can be
used by a compiler that compiles a global name space program (e.g., written in
HPF and its extensions) or by users writing programs in node+message passing
style.
First we describe the runtime support and the transformation for a restricted
version of the the problem in which it is assumed that only part of the data
(large data structures) are out-of-core. Then we generalize these techniques
in which all the major datasets of an application are out-of-core.
The main objectives of the proposed techniques is to minimize I/O accesses in
all the steps while maintaining load balance and minimal communication.
We introduce experimental results from a template CFD code to demonstrate the
efficiency of the presented techniques.
ptlib_maintainers@nhse.org