| <- HREF="ch6.html" Prev | Index | Next -> |
NHSE ReviewTM: Comments · Archive · Search
This portion of the effort was concentrated on capabilities that are accessed, not via command- or higher-level environments, but through library calls from within HPC applications. The group addressed the current state of HPC libraries in general, lamenting the lack of consistency and interoperability which makes it virtually impossible to develop a single source code application that will execute successfully, much less efficiently, across a full range of HPC platforms.
The discussions covered libraries supporting: message-passing, remote memory operations, threads, a variety of mathematical routines, parallel I/O, timers and other performance monitoring functions, and several functions associated with managing parallel processes. (The topic of parallel class libraries was also raised, but later dismissed because it was felt that the technology was not yet mature enough to formulate clear requirements.) All but the process management features are described in this chapter. Requirements for obtaining information on process status will be found under Debugging Tools, 5.1, and support for cleaning up orphaned processes is included in Operating System Services, 8.2.
For all library support, the group insisted that APIs (application programming interfaces) must be usable from Fortran77, Fortran90, C, and mixed-language programs.
It was noted that while standardization of APIs probably is the most feasible approach to arriving at consistency in the next few years, many efforts in this direction have misfired or have drawn out indefinitely. (MPI is a notable exception to this pattern.) The group acknowledged that a lack of clear user-community support for standards efforts has had negative impact on library standardization. This has increased the difficulty of achieving consensus, particularly in cases where vendors have already invested significant effort to develop proprietary libraries. For some requirements areas, therefore, the group specified the availability of a "published API" - that is, invocation syntax and semantics spelled out in the documentation available to users of a system - in lieu of a standard.
While most programmers would prefer to be able to implement parallel applications at a higher level of abstraction, the group agreed that message-passing was the only viable way to achieve reasonable portability across current HPC systems. After a prolonged discussion, it was also decided that at least for the next few years, PVM as well as MPI should be supported.
While a variety of reasons were cited, the most important was the inability of MPI to support applications across heterogeneous machines - even when those machines are manufactured by the same vendor. Although such applications are becoming commonplace, and will certainly grow in popularity over the next few years, the MPI forum de-emphasized interoperability in favor of the performance gains to be realized by constraining the execution environment. Application developers who need to attain good performance are finding that MPI programs need to be tuned for each target platform [14]. The wide latitude given to MPI implementors suggests that interoperability will not be possible within the near term. (See "New Directions" for recent efforts to improve this situation.) PVM, on the other hand, was designed with heterogeneity in mind, although some vendor implementations impose restrictions on interactions with other platforms. It also includes features that can be helpful for fault tolerance in heterogeneous applications.
A key criticism levied against existing implementations of message-passing libraries was that many vendors fail to keep up with the "current" version; one case was cited where a comparison across half-a-dozen vendors revealed that only one supported the complete, up-to-date standard. Moreover, incompatibilities or inconsistences must often be identified through a process of trial-and-error, since documentation typically implies that the version is complete. In the interests of supporting application portability, the group insisted that all library routines in the standard should be supported; if a particular operation is unnecessary or redundant on a particular platform, it can be implemented as a dummy.
Vendor representatives raised some of the issues that make compliance difficult, including the relatively rapid rate at which message libraries have evolved. Expecting each vendor to maintain up-to- date implementations of two very different libraries (MPI and PVM) was unrealistic. A compromise position was reached, where the group specified that MPI be kept current, while PVM could be frozen at version 3.3.7.
Links to the Guidelines document:
Remote-memory operations - also referred to as one-sided messages - are becoming of increasing interest to application programmers, particularly on clustered and SMP (symmetric multiprocessing) systems, since they offer a less intrusive mechanism for sharing data than do traditional message-passing systems. While a number of such operations have been proposed and implemented, the group was concerned only with support for the three most essential functions: remote get, remote put, and atomic increment-and-return-previous-value.
Although the MPI Forum had organized a working group to define such operations, there was no standard at the time the task force met (MPI-2 was not released until summer of 1997). Consequently, the group elected to recommend that vendors support basic one-sided functionality as soon as possible, even if the operations required platform-specific syntax.
Thread operations are at a somewhat different state of evolution. While there is a formal standard, established by the POSIX 1003.4 working group, pthreads are defined for a shared address space only. (The MPI-2 standard, released in mid-1997, partially addresses this problem.) Further, pthreads have not yet been implemented on many parallel or clustered machines. The group decided that it was important to require thread support, for several reasons. First, thread technology is certainly known, and will soon come to dominate low-end computing platforms. The availability of thread functionality may make it possible to leverage developments in that more lucrative market. Threads might provide an opportunity to exploit the communication processors packaged on some parallel nodes. Finally, this is a clear example of a standard that was successfully established, and then effectively ignored by the HPC community, largely because it did not address the broader issues (i.e., when address space is not shared). The fact that threads address and solve a specific, small problem does not make them irrelevant; rather, it is precisely the reason why standard thread operations should be available on all HPC platforms.
Links to the Guidelines document:
The importance of math libraries to application programmers is not, perhaps, as obvious as it should be. Users participating in the discussions indicated that a considerable portion of the time and effort in developing HPC applications is spent deriving and testing their own versions of quite standard mathematical functions, simply because they are not available on the target platform. One particularly clear example is FFTs. In spite of the fact that the algorithms are clearly known, there is no generally available library support for FFTs. A significant majority of the application programmers on the task force reported that they, and their colleagues, are forced to implement roll-your-own versions, often changing or rewriting them for each new platform they use.
The foreknowledge that certain types of functionality would be available on all HPC machines would free the programmer to concentrate on the specific domain of the application, significantly reducing overall development time. Redirecting the implementation of math routines into the hands of fewer people would probably improve reliability and accuracy as well, since library implementors are more likely to be experts in algorithmic developments than HPC programmers whose primary interest is CFD, molecular chemistry, etc. Several users cited the example of TMC's library, CMMSL, which was well defined, very complete in a number of areas, and easy to incorporate into applications (because of both the library's organization and its comprehensive documentation).
The group spent a considerable time discussing which particular functions were needed for so many applications that they should be available on all HPC platforms. Careful distinction was made between functions that should execute on a single PE (processing elemnt), versus those that should be capable of executing across parallel PEs. They noted that the absence of a library from the guidelines does not mean it is unimportant, but rather that its use is relatively specialized.
Note that the requirements specify functionality alone - there is no implied requirement for a standard API. While that would be desirable in the sense that it would not be necessary to re-write applications in migrating to a new platform, issues such as data layout are still highly platform-dependent, making it difficult to arrive at generally applicable standards. However, the group strongly recommended that vendors provide for arrays with dimension padding (i.e., where the array is sized larger than strictly necessary for its contents), as well as options for different data layouts or scrambling schemes.
FFTs. For serial execution on a single PE, support should be provided for one-, two-, and three-dimensional FFTs for both radix-2 and mixed-radix arrays, covering complex-to-complex, real-to-complex, and complex-to-real formats. Similar support is needed for parallel execution across PEs. No specification was made by the group as to the data layout or scrambling, just that the implementation support at least one reasonable scheme. That is, users recognize that they may need to adjust their data to fit the format required by each platform.
Linear Algebra. Here, API standardization has already been achieved and should be supported. BLAS is already standard on serial platforms, and Levels 1, 2, and 3 should be provided for execution on individual PEs. LAPACK should be similarly supported. ScaLAPACK, the parallel version, should be supported for execution across multiple PEs.
Sparse Matrix Operations. Basic support is needed for both serial and parallel operations, applicable to general-sparse and sparse- block representations. Five permutations of sparse/dense vector and matrix combinations, plus scatter and gather operations, were identified as essential. In addition, three basic mesh operations (generate_dual, partition_mesh, and reorder_pointers) are needed, again in both parallel and serial versions, because of their usefulness in finite-element methods.
Eigensolvers. The basic routines for dense matrices defined by LAPACK should be available for both serial and parallel use. In addition, similar functionality should be available for sparse matrices.
Random Number Generation. Users were vocal about the need for a parallel random number generator, such that using the same seed on multiple PEs or across HPC platforms would result in the same, reproducible sequence of random numbers. Also, there must be a mathematically sound method of choosing seeds, in order to be able to produce different sequences of random numbers, if desired, on different PEs. The group identified a lagged fibonacci generator with Mascagni's seed selection algorithm as the best method for satisfying these constraints.
Array Transposition. Another important need is for fast, reliable methods of redistributing arrays across PEs, corresponding to all permutations of the array's indices. Further, the user should be able to choose among multiple decompositions by specifying which indices are distributed. Support should include the ScaLAPACK distribution and blocked distributions where up to three indices are distributed. There must also be facilities for converting among those data decompositions.
Gather/Scatter Operations. As more applications use irregular data layouts, efficient gather/scatter operations become more essential. Such algorithms must be scalable, although the best solution may be to provide a number of implementations to cover the diversity of application size and layout needs. If possible, blocked or grouped gather/scatter operations, using collective communications within a subset of nodes, should also be provided.
Links to the Guidelines document:
Considering that increased performance is the primary motivation for moving to HPC platforms, it is surprising that there is so little support available for obtaining accurate information on the performance of HPC applications. In particular, the users in the group were vocal about the need for reliable and accurate timing mechanisms. UNIX's time-of-day utility is the function available across multiple platforms, but since it involves a system call it is too unreliable and intrusive for HPC applications.
The group identified requirements for supporting interval timing, as the primary need for applications development work. While a globally synchronized timer represents the best solution, it was recognized that this is a costly hardware solution that may not appropriate for all systems. However, it was agreed that at least a single process/thread timing should be available in all cases. Specifically, such support should provide the best accuracy possible on the platform, with the least possible intrusion, and track not just wallclock time, but also user CPU time and system CPU time.
Vendor representatives pointed out that the concept of system CPU time was far from clear, and varied tremendously from one platform to another. For example, it might or might not include time spent waiting for I/O, performing page fault services, initializing the communication system, etc. Users responded that it was not necessary to provide times that were comparable from one system to another; rather, the point of such timers was to establish whether a particular code modification resulted in performance changes on a particular system. As long as the accounting was reasonably consistent from run to run, it could be quite platform-specific. Since this requirement was the target for a recent Parallel Tools Consortium project, the group identified their API for portable timing routines as the appropriate standard.
In a related discussion, basic arithmetic operation counts were identified as important performance-related information that should be obtainable from within an executing application. Since the purpose is to validate performance tuning efforts on fairly coarse regions of code (typically at the level of large loops or a number of iterations of smaller loops), it is not necessary that such counts be completely accurate, nor that the programmer be able to account for or compensate for compiler optimizations. The users in the group were clear that what they needed was access to the information that could be tracked by the hardware, rather than a new type of information requiring sophisticated implementation strategies.
Links to the Guidelines document:
A considerable amount of time was spent discussing parallel I/O, as it is viewed by users as the most pressing need for library support. In spite of at least three organized efforts - MPI's I/O working group, the Scalable I/O Initiative, and IOPADS - in this direction, no consensus had been reached at the time the task force met. (MPI-2, released in mid-1997, provides such functionality.)
From the perspective of users, the group identified three distinct types of parallel I/O. It noted that current standards efforts were directed at the kind of comprehensive, low-level support required by compiler and library developers. A full implementation would provide a range of powerful and flexible operators, providing control over virtually every aspect of file I/O. However, the users in the group agreed that this is too low a level for most application programmers.
A higher level of parallel I/O is needed as well, whereby it is possible for multiple processes to read from or write to a logically shared file. (The sharing is "logical" in the sense that users don't care if the processes are reading from a single copy of the file, or sharing an index that is applied to multiple instances of the file.) This should be possible without adding more than a handful of new operations. Intel's parallel file system, for example, makes it possible to perform parallel I/O with no changes to the programming language's basic input and output operations. Instead, the user simply adds a new argument, controlling I/O mode, to the file open statement.
A third type of I/O relates specifically to user-mediated checkpointing, or the ability to dump large data structures to disk, and later restore them, using extremely efficient - but quite restrictive - operations. Such facilities might be extremely crude in terms of providing no real formatting capabilities. However, it would be unreasonable to require that data be restored onto precisely the same PEs from which they were dumped, as some vendors currently do.
Users do not expect all three of these levels to be available immediately. However, they were insistent that they must have at least rudimentary parallel I/O facilities now, even if more flexible and better performing solutions are still a few years off. In particular, they want simple, Fortran- and C-compatible libraries to perform four very basic types of I/O:
Links to the Guidelines document:
| <- HREF="ch6.html" Prev | Index | Next -> |
NHSE ReviewTM: Comments · Archive · Search