NHSE Review 1997 Volume First Issue: Establishing Standards for HPC System Software and Tools -- Chapter 3 -- How HPC Differs from Other Areas of Standardization

NHSE Review^TM 1997 Volume First Issue

Establishing Standards for HPC System Software and Tools

| <- HREF="ch2.html" Prev | Index | Next -> |
NHSE Review^TM: Comments · Archive · Search

Chapter 3 -- How HPC Differs from Other Areas of Standardization

HPC, like other areas of computing, shares some of the most common problems associated with standardization efforts. There are many areas of computing where it is generally felt that standardization has not been as effective as it should be. This is problematical for vendors as well as users. As one standards activist put it:

Another shared problem is the fact that it isn't clear who should be responsible for defining and enforcing standards. A number of authors have explored the issues of whether government, non-profit organization, or for-profit standards groups should take the lead in developing new standards efforts [21, 16, 23, 6].

There is no clear evidence that any one approach is the best solution. In HPC, this may be something of a moot point, since to date, HPC standards projects have always involved multiple HPC vendors, some software researchers from academic and federal organizations, and some users (with the exception of Ptools projects, which mandate a high level of user involvement).

3.1 The Nature of the HPC Community Constrains Standardization

The HPC community itself imposes a number of serious constraints on standardization efforts. Consider the position of HPC vendors. At any one time, there are fewer than a dozen companies producing HPC machines. Risks are extremely high, resulting in a distressingly short mean-time-to-bankruptcy. These companies are struggling to exist, not contemplating how standards might improve the future directions of HPC. The turnover in technical personnel is quite rapid, so it's hard to find the longevity and personal dedication that characterizes standardization participants. Further, there is a widespread notion that "software doesn't sell HPC machines." Vendors do not perceive a real economic advantage to participating in standards efforts or implementing features that would make their machines interoperate with others. Most importantly, HPC represents only a tiny fraction of the computing market. Despite the evidence that HPC's pioneering technology goes on to permeates the broader market, HPC is not perceived as important to the success of other product lines.

The nature of the HPC user community is constraining, too. Application developers tend to be widely dispersed, in terms not just of geographic location, but also disciplinary and agency boundaries. They apply very different types of computational techniques to very different types of problems. Some applications are destined to become production-level "work-horses" that will be re-used for years or even decades, while others are one-off research codes that address a single problem and may be used only a few times. Frequently, the only common denominator is the fact that HPC applications are computationally intensive. The lack of homogeneity applies to users' levels of HPC expertise as well. The impact of this, while not measurable in any traditional sense, is significant. Since few HPC programmers have formal training in computer science, it is their personal experiences which often shape their perceptions of software requirements and their expectations of standards efforts.

The HPC target machines pose additional constraints on standardization. The architectures at any given time represent a spectrum of approaches, each with very different performance profiles and constraints on software and tool design. Rather than addressing a single class of target, an HPC standard must address a whole collection of them. Moreover, HPC architectures are in a constant state of flux. It is very difficult to project a likely path for evolution beyond the next two to four years. With so little stability, standards writers are forced to provide a more generalizable (or more fragmented) solution. This, in turn, makes it difficult for standards implementers to achieve good performance. But failure to meet performance expectations will, in turn, decrease user demands for the standard.

3.2 The Window-of-Opportunity Problem

Another constraint is that the window of opportunity for HPC standards is very short. One aspect of this problem is the need for standards committees to arrive quickly at consensus. Consider the example of the Parallel Computing Forum (PCF), which later became ANSI subcommittee X3H5. It began with a seemingly simple goal: to standardize the many flavors of shared-memory parallel Fortran that had evolved on different vector multiprocessors. Despite the fact that only a handful of language features were needed to meet that goal, the effort foundered.

The primary reason was that the window of opportunity for introducing such a standard was extremely short - perhaps two to three years. Meetings occurred only three or four times a year, however, with little committee activity in between. The result was that the committee's work was soon out-paced by other developments in the HPC industry. As companies went in and out of business, committee attrition soared. By the time agreement had been reached, all participating companies had implemented their own (conflicting) versions of what they hoped the standard would be, and many had already moved their attention to distributed memory systems where the constructs had no application, anyway. The lengthy process of formal review and comment imposed by ANSI was the final blow to an already moribund effort.

There is also a very small window of opportunity for implementing the standard in the form of full-fledged products. Consider the experiences of the High Performance Fortran (HPF) Forum. The HPF effort began in January of 1992, addressing the need for data-parallel extensions to Fortran. Despite very frequent meetings and a relatively large core of involved persons, it took almost two years to arrive at an initial version of the standard. The serious obstacle, however, proved to be implementation. It was important that array operations be available at the level of Fortran statements. Since these had been defined by the Fortran90 standard, the HPF group decided to require Fortran90 as the base language for HPF [5]. While this made sense from the perspective of supporting other standards efforts, it severely hampered the implementation of HPF - none of the companies involved in the HPF group had a Fortran90 implementation yet.

One company, Applied Parallel Research, released a product soon after the HPF standard was released, but it implemented only a subset of the functionality. Several other products have reached the market since that time, but they do not implement the full standard either. Meanwhile, response from the user community has been lukewarm. Some have cited the fact that "after all the hype about HPF, there still aren't any compilers that can handle the features I'm interested in" [14]. Others say they are disappointed in what they've heard about HPF performance and don't think it's worth the effort to port their own codes to the new language. It's not really surprising that it has been difficult to implement a full version of HPF that performs well - it was an ambitious project and the lack of existing Fortran90 compilers was a real bottleneck. The real window of opportunity already had passed, so it is going to be very difficult to woo users to HPF.

3.3 Bringing HPC Standards to Success

The Message Passing Interface (MPI) Forum provides an interesting contrast to the experiences of PCF and HPF. Beginning in November of 1993, the group worked to standardize existing practice in message-passing libraries. The first specification was released a year later [10].

MPI has been very successful, if the success of a standard is measured in terms of full implementations and the number of users who have adopted it. The contrasts between MPI and HPF are instructive:

A reference implementation closely shadowed the evolving MPI standard, allowing the committee group to verify feasibility and more importantly, allowing some fairly hardy users to begin porting their applications quite early in the process.
Because MPI drew heavily upon the semantics of existing message-passing libraries, there was a sound foundation on which vendors could base their implementations.
Because MPI is essentially the union of all previous message-passing practices, it proved relatively straightforward for compiler, library, and application developers to convert from other libraries to MPI.
Because MPI is a low-level programming interface (library calls) that built on existing technology, it was much easier to achieve respectable performance.
Largely as a consequence of the other four factors, MPI began appearing as a requirement in HPC procurements very soon after the standard was released. This provided additional motivation for vendors to achieve timely, fully supported implementations.

The Parallel Tools Consortium (Ptools) is a different type of standards organization. Rather than defining a general standard such as a language or library, it has focused on very small "components" that can be integrated into larger systems. The idea was to arrive at standards incrementally, so that vendor implementation effort - and indirectly, the time required for a standard to arrive at users' desktops - would be reduced. Consider the example of Ptools' Message Queue Manager project, which was started in April, 1994 and released as a standard approximately 18 months later. It specifies a tool interface whereby users can capture snapshots of the message-passing system during program execution and examine them to determine if logical or performance problems have occurred. Despite the concentrated focus of the project and the fact that several vendor organizations participated, however, implementations have been slow (two are currently available, with another two expected by the end of 1997).

A more instructive example might be Ptools' Portable Timing Routines, which defined a standard API for accessing the least intrusive, highest resolution timer on each HPC platform. In this case, vendors contributed the low-level instructions for accessing the timers, while other members of the working group added wrappers implementing the API. The resulting platform-specific implementations were mounted on shared Web pages for downloading by users. Within a year of the original proposal to develop the timers, several implementations were already available for use. This model may be the most viable one for standards that are small and relatively easy to define, since it bypasses the lengthy process of developing and distributing a fully-supported vendor product. A similar project has recently been proposed for improving the interfaces of parallel debuggers (the High-Performance Debugging Forum; see New Efforts in HPC Standards).

The problem is that arriving at a definition for a standard is just the first step. Both the vendor and user communities must "buy in" to the standard in order for it to succeed. It must be implemented and deployed across key vendor platforms so that other vendors will be pressured to conform, and so that the user community will believe it is stable enough to warrant their own investment of time. The implementations must be robust and they must be good performers - bad news travels fast in the HPC community. All of this must happen in the face of declining HPC resources and shifting priorities. And to be effective, all the pieces must fall into place within a short period of time.

While standardization obviously can work for HPC, it is a risky proceeding. The most critical bottlenecks are delays in arriving at a standard definition, and delays in release of standard-compliant products. If either lasts too long, all previously invested efforts will be wasted. This introduces the question of what inducements might encourage vendors and other participants to concentrate their efforts more, in order to speed up the standardization process. One recurring suggestion is that requirements, as well as software, be somehow "standardized". Not only would this ensure that all parties understand what is needed in HPC software, it would also make it possible to throw economic clout behind the requirements, by including them in procurements of HPC systems.

3.4 A National Task Force to Standardize Requirements for HPC Software

The idea of standardizing HPC software requirements has been addressed by several workshops over the past few years, including the Pasadena Workshop on System Software and Tools for Parallel Computing Systems (Pasadena, 1992) [11], HPCC Grand Challenges Workshop (Pittsburgh, 1993), ACM/ONR Workshop on Parallel and Distributed Debugging (Santa Clara, 1991) [12], and two ARPA/NSF Workshops on Parallel Tools (Keystone, 1993 and Cape Cod, 1994)[22]. Most of these groups proposed that it was up to the user community to set standards for software, and that the best way to exert pressure on HPC vendors to meet requirements for robust and consistent software was for the user community to agree on how software should be specified in writing HPC procurements.

At the Second Pasadena Workshop, in January of 1995, a formal recommendation was made to the HPCC community that a national task force be created to address this problem. Specifically, it was charged with establishing the basic requirements for a "standard" software infrastructure that would support the development of parallel applications. In making the recommendation, the group underscored the importance of several elements in making such a standard feasible:

The requirements established by the task force must be representative of the HPC user community as a whole, not just particular users.
It must be possible to implement the requirements in a robust and consistent fashion across a full range of parallel and clustered computing platforms, without favoring any particular vendor(s).
The requirements should be defined with the knowledge (and hopefully, the collaboration) of HPC vendors, to ensure that they will be implementable by all vendors within a reasonable period of time.
To encourage vendors to implement the requirements, they must be thoroughly publicized and the help of user sites/groups should be enlisted for explicitly requesting their implementation.
The HPCC agencies should be asked to put real power behind the requirements by seeing that they are included explicitly on procurements that depend on federal funding.

Three of the Workshop's participants - Bruce Blaylock (from NASA's Ames Research Center), Robert Ferraro (NASA Jet Propulsion Laboratory and CalTech) and Cherri Pancake (Parallel Tools Consortium and Oregon State University) - joined forces to lead that effort. The Task Force was endorsed by the Parallel Tools Consortium and the National Coordinating Office for HPCC. It included over sixty representatives from major user sites, as well as commercial software vendors.

The results were released in conjunction with Supercomputing '95, in November of 1995. They were reviewed by the HPCC agencies in early 1996, and formally adopted in May of 1996. Two levels of requirements are specified:

the Baseline Development Environment, or software elements that the task force recommended for any HPC system where applications will be developed (regardless of system size or specific user characteristics); and
Priority Capabilities, or elements that will be needed by many, but not all, sites.

The procedures followed by the Task Force have been documented elsewhere. In general, there was quick consensus about the need for system software and tools that:

provide a small number of key functions
are easy to learn and use
work consistently on any HPC machine
work for real-world-scale (i.e., very large and/or long-running) programs

The problem came in naming a software product or specification that could serve as a standard for citing in procurements.

According to the guidelines established at the outset, a particular interface could be required only if an appropriate specification already existed in the form of a citable product, reference implementation, or formal definition. This was in the interests of practicality, since a vendor could not be expected to adhere to standards that could not be described or cited. Moreover, the requirements could not refer to capabilities or features that required new technology, since the goal was to begin applying the standard in procurements as soon as the task force completed its work. (In fact, DOE's Accelerated Strategic Computing Initiative (ASCI) used the draft version in preparing a procurement in autumn of 1995.)

The six chapters which follow describe the task force deliberations in each category of system software and tools discussed. In general, the group agreed that while existing products - particularly software tools - do meet some user needs, most also add superfluous functions, over-complicate even the most basic operations, and are much too platform-specific. Therefore, very few products or definitions were named in the final recommendations; these were from a handful of organizations (GNU, MPI Forum, Open Software Foundation, Parallel Tools Consortium, POSIX). Other elements had to be specified in terms of the functionality required or desired, with the understanding that vendors would probably implement them in distinctive ways. The rationale for these decisions is outlined in the text for each chapter. Hyperlinks provide direct reference to the wording that was adopted in the final document.