NHSE ReviewTM 1997 Volume First Issue

Establishing Standards for HPC System Software and Tools

| <- HREF="ch7.html" Prev | Index | Next -> |
NHSE ReviewTM: Comments · Archive · Search


Chapter 8 -- Operating System Services

Computer users have come to take for granted that a stable operating system will provide the basic services needed by their applications. Since they undergo a continuous process of testing and update, HPC operating environments tend to be neither stable nor complete. As HPC users have learned through sometimes bitter experience, mean-time-between-failure on parallel and clustered computing platforms may be measured in days or even hours.

The completeness of system services, while not as immediately obvious, can have just as much impact on application development as system robustness does. It is not just the application itself, but all the supporting layers of system software, that rely on operating system services. When a basic service is missing or malfunctions, the effect is annoying at best; at worst, it is debilitating. Therefore, a special subgroup of the task force, composed of system support staff from several major HPC centers, was charged with defining what system features were essential in providing a reasonable operating environment.

The group's discussions focussed on six types of system support: base services; job management; features supporting checkpointing individual jobs, and at the level of the entire system; resource management; file systems; system availability features; protocol support; and miscellaneous tool support such as documentation and system administration tools. All but one are discussed in this chapter. The final item is treated separately as Other System Software and Tool Support.

The general lack of interoperability with other elements in the computing center environment was a particular source of concern to the group. No HPC machine can operate without the assistance of a myriad of other equipment, from disk and tape storage to printers, visualization engines, and desktop workstations. Yet in many cases, the HPC systems cannot interoperate smoothly with other elements. System support staff spend an inordinate amount of time debugging interface problems or constructing new interfaces to fill the gaps left by HPC vendors. This is a costly investment that must be made over and over, by each site and for each platform installed. The availability of a known collection of services, responding to known interfaces, across all HPC platforms would represent a significant step forward.

8.1 Base Services

Many of the most essential operating system functions, taken for granted on workstations or large, non-HPC machines, are not available in a consistent or robust way on HPC systems. The need is critical, in the sense that all higher-level system facilities - and ultimately, the applications - depend on the services of the base system.

Since any HPC system will have to interact with a range of devices via network connections, it is essential that there be verifiable mechanisms for allowing one entity to prove to another that it is what it claims to be. In this sense, entities include servers and other machines, as well as users. However, in the interests of system performance, such authentication cannot need to be repeated every time that a new resource is connected. Access control, or the ability to associate privileges with entities or sets of entities, imposes similar requirements for interoperability and performance. Ideally, credentials (i.e., authentication and privileges) should move around with all authenticated entities within a particular administrative domain. (The ability to support encryption so that hardware or low-level software interception does not violate integrity, is important, but may not be required for all installations.)

This type of interoperability presupposes that authentication on the HPC system can interoperate or coexist with existing UNIX services in its broader environment, including login, passwd, rcp, rlogin, and rsh. It must also be possible to authenticate RPCs and to ensure that RPCs are neither discarded nor ignored. For clustered systems, it should be possible to formulate both intra-node and inter-node communications using the same API.

Additional discussions centered around service namespace management, which maps service names (e.g., a host or printer name) to service locations (specific network addresses and ports). Such support is essential in order for the HPC machine to provide reasonable interoperation with the larger system. One very desirable feature would be the provision of "smart" services capable of re-directing requests to another, equivalent server if the requested server were unavailable or subject to heavy congestion.

Links to the Guidelines document:

All requirements related to authentication, security, and namespace management
All requirements related to base operating services

8.2 Support for Job Management

The role of an HPC system's job management component, with respect to a job, is much like the role of the kernel on a single node with respect to a single process. That is, the job manager must authenticate and track all jobs that enter the system, allocate resources to them, schedule them for execution, enforce site policies on resource limits and exceptions to those limits, and clean up after they terminate. (Note that there may not be an identifiable OS entity in charge of these services.) The group spent some time discussing the shortcomings of current HPC machines in supporting job management.

The most critical requirement for OS support is that jobs be elevated in status so that it is possible to manipulate a job as a single entity, just as it is currently possible to manipulate a process. In fact, a job is simply an extension of the notion of process, consisting of a collection of one or more, possibly independent, processes on one or more PEs (processing elements). It should be possible to kill a job as a single entity, thereby affecting all its processes and freeing all resources currently allocated to them. Currently, such operations must be carried out incrementally, often without any clear information about which processes are interrelated or where they are currently holding resource. Crash recovery methods should be able to detect that a portion of a job has died and clean up its other processes, in particular, releasing the locks on resources. It should also be possible to query both the characteristics and the current state of jobs, and for the system administrator to modify a job's priority, privileges, and resource limits - as can be done for processes under UNIX.

In addition, each HPC system should be required to provide a batch system capable of queueing jobs for any subset of user-accessible PEs or nodes. For uniformity, the system's interface should conform to the POSIX 1003.2d specification. Job characteristics and state information should be maintained in some form of database so that the job's owner, or the system administrator, can pose queries such as "which nodes is my job running on?" or "why isn't my job running?" The information also should be logged, together with exit conditions, so that it can be analyzed for system tuning activities.

The system support participants were very concerned that current job scheduling software is too restrictive. Rather than providing mechanisms that can be tailored to the needs of individual sites, it enforces rather uniform policies that are not appropriate for many organizations. At the least, HPC systems should be required to support space-sharing (tiling); this makes it possible to allocate PEs as dedicated resources in support of non-overlapping jobs. This feature is critical for benchmarking as well as for specialized PE- linked resources (e.g., high-speed I/O links), which may suffer from performance degradation when they are shared by multiple jobs. An API permitting the system administrator to tailor scheduling policies to specific site requirements would be highly desirable.

In addition, the group recommended that time-sharing (i.e., allowing multiple processes to execute intermittently on the same PE) be supported for at least some PEs. Recognizing the difficulties that would occur if processes being time-shared were trying to communicate either among themselves or with processes time-shared on other PEs, the group noted that there does not need to be any guarantee of synchronization across nodes. Unsynchronized time-sharing would be a very acceptable model, for example, on interactive nodes where the performance degradation due to lack of synchronization would be masked by other factors.

Links to the Guidelines document:

Job management in the Baseline Development Environment
All requirements related to job management

8.3 Support for Checkpointing

The ability to checkpoint applications is essential, given the lack of robustness typical of HPC systems and the tendency of HPC applications to be long-running. Specifically, it should be possible for the OS to store the current state of a process on a particular PE and continue executing, at the request of either the job or the job scheduler. Similar capabilities should be provided at the job level, so that the information on all processes belonging to a job is saved via a single request. A standard API should be available so that the user can determine if his/her job was checkpointed.

Recognizing that the issue of re-starting execution from checkpoint data was problematical, the group recommended that libraries be provided to assist the programmer to save specific types of checkpoint information (e.g., contents of arrays). Corresponding routines would make it possible to re-load that information in a subsequent execution, creating the effect of a limited re-start facility. It is important that such features not require that the later execution take place on exactly the same PEs or nodes as the original execution.

Links to the Guidelines document:

All requirements related to checkpointing

8.4 Support for Resource Management

The group discussed resources very generally, including "everything that a site might want to account for, limit, or take into consideration" when scheduling jobs or assessing overall system performance. CPU time, memory size, number of nodes, number of PEs per node, availability and use of particular network adapters, and disk use, are all key resources and were considered to be a "minimal set" of manageable resources in the context of the discussions.

Such resources must be allocatable, not just to individual jobs, but to individual processes within a job. This implies several ancillary features. A centralized resource database must keep track of the state (up, down, reserved, dedicated, etc.) of system resources, together with current usage policies (who can access it and for how long). It must be possible to enforce "hard limits" on resource allocation/use; that is, if a user or job exceeds the total allocation permitted, the job would be killed, suspended, held, or rejected. There must be an API - operable with all programming languages on the platform, or at least Fortran, C, and C++ - for getting and setting status values. It must be possible to aggregate the resource data for all processes of a job in order to provide accounting records.

Other recommendations would expand on these basic capabilities, providing richer information on resource use that would enable better tuning, not just of individual applications but also overall system performance. One very important addition would be the ability to assign "soft limits" on resource allocation, and then to be able to detect and report violation of those limits. Similarly, the availability of accounting information accurate to the process level would make it possible to identify resource load imbalances. A final suggestion reflected the difficulty of rescuing the system once one application has gone out of control: the ability to selectively preempt or revoke critical resources from a job.

Links to the Guidelines document:

Resource management in the Baseline Development Environment
All requirements related to resource management

8.5 File System Support

In general, file systems can be differentiated into three levels: local, distributed (but with a single namespace), and network (non-uniform namespace). The group noted that most existing HPC systems rely on NFS to provide extra-local services, but that NFS is a network (rather than a distributed) file system and is not really adequate for HPC requirements.

The most important requirement for local file systems is that they be compliant with the \link POSIX 1003 file system standard, including support for long file names. Other key factors are that the size limit on the file system size be greater that 4 gigabytes, and that single files of more than 4 gigabytes also be permitted.

The current reliance on NFS for non-local support should be remedied. In particular, a distributed file system is needed, providing the following capabilities:

  1. a single file namespace
  2. smart caching of file data to take advantage of user access patterns
  3. protocol guarantees data consistency for shared writes
  4. replicated file service
  5. integrated with security/authentication service
  6. protocol can be tuned for different network characteristics (e.g., using large block sizes to take advantage of fast networks)
  7. location of file server hidden through the service namespace manager
A system compatible (at the interface level) with DCE/DFS version 1.1 would meet these needs. It was noted by the group that AFS is not satisfactory because it does not meet items 4 and 5.

Finally, mechanisms need to be put into place that make it possible to backup and restore any portion of the file systems (local as well as distributed).

Links to the Guidelines document:

File systems in the Baseline Development Environment
All requirements related to file systems

8.6 System Availability Features

In all discussions of the task force, concern was expressed about the problems of HPC system availability. Not only do these systems incorporate relatively new and untried technology, but the applications run on them tend to push to the edge of resource limitations. Degradation of service, and even complete failure, is all too frequent. It is particularly annoying when failure of one key resource (e.g., communications switch) or one application brings down the whole system, or necessitates a full reboot.

Given the somewhat experimental nature of HPC systems, the group recommended that more diagnostic support be provided. More mechanisms should be present for detecting and reporting failure of individual PEs, network paths, disks, and other critical resources. Further, communications across PEs - either messages or RPCs - must be guaranteed, at least to the extent that the sender is always notified if a communication is discarded or ignored.

Additional support is also needed for performing consistency checks of system configuration parameters. A system management tool should make it possible to do this on-demand. Further, failures or downtimes are not all of equal duration or impact. At the level of individual PEs, it should be possible to perform self-tests of local memory, file systems, adapters, etc., at power-up or reboot, without having impact on the remainder of the system. A secondary level of self-test would make it possible to check the consistency of global or distributed system memory, access to distributed file systems, and inter-PE communcation paths. When the entire system is rebooted, it should be possible to choose between a speedy reboot, or one entailing complete consistency checks for the entire system.

Link to the Guidelines document:

All requirements related to system availability

8.7 Protocol Support

One strong criticism levied by the group was that HPC systems do not always support established standards at the operating and network system level. TCP/IP and UDP/IP are essential if the machine is to interoperate with others in its environment. The availability of IPI-3 channel protocol is also important.

In terms of operation system services, there was extensive discussion of what level of service was needed on each PE. Obviously, it is not desirable to occupy local space or cycles supporting functions that are really extraneous to HPC, such as mail services. At the same time, many applications, particularly those built using standard components libraries, do require access to more than base-level service. The XPG4 standard, version 2, was judged to provide a sufficient level of functionality. (It was noted that POSIX 1003 does not suffice to meet the needs of HPC applications.) The group agreed that while availability of those services was essential, it was not necessary that the services actually execute on each PE.

Link to the Guidelines document:

All requirements related to system protocols

Copyright © 1996


| <- HREF="ch7.html" Prev | Index | Next -> |
NHSE ReviewTM: Comments · Archive · Search


Copyright © 1997 Cherri M. Pancake