| <- HREF="ch8.html" Prev | Index | Next -> |
NHSE ReviewTM: Comments · Archive · Search
In addition to the system software infrastructure outlined in the previous chapter, the group discussed several other aspects of system software related to the administration and management of HPC systems. Generally, these fell into the areas of resource administration, debugging and tuning of the overall system, and software documentation.
Fault tolerance and fault prevention were also discussed, but were eventually excluded from the document for several reasons. First, improvements in this area typically involve additional or enhanced hardware, which was beyond the scope of the task force. Second, it was not clear that the requirements across a wide range of sites were consistent enough to warrant guidelines. Finally, there did not appear to be consensus on how much an "average site" would be willing to pay for improvements in this area.
The group noted that system administration issues are often neglected by HPC vendors. Many of the traditional (i.e., single-PE) system administration tasks are still necessary on multi-PE systems. It is essential that they not require proportionately more human intervention as the number of PEs increases. Since most of the tasks are embarrassingly parallel, they ideally should require only a constant amount of time. In any case, administrative tasks should work the same way for any arbitrary collection of PEs, from a single PE to the entire complement.
The tasks that are of most importance to current HPC sites are user account administration, file system mounts, software installation, resource administration, and PE control (e.g., status and consistency checks, re-booting). Control over these should be from a single, central point so that the parallel or clustered system can be managed consistently.
Another concern was that some vendors have elected to provide a monolithic, proprietary tool for performing all system administration tasks. In practical terms, this is undesirable. Since all operations are integrated into a single tool, concurrent tasks carried out by multiple staff members with different responsibilities can become difficult or dangerous. The situation is exacerbated when the tool does not interoperate with more traditional methods of system administration such as editing system files either manually or through scripts. System administrators must tailor their procedures to installation-specific requirements and priorities, so administrative support software should not enforce a single tool access point or assume that all sites can conform to uniform procedures.
As user pressure on HPC systems has increased, system administrators have found it more and more important to tune overall system performance to match priorities at their sites. For this activity to reflect more than an educated guess, they need tools that assist in identifying and resolving system bottlenecks, capacity limitations, potential hardware problems, etc. The group noted that in order for such tools to be effective, they must be capable of recording and aggregating statistics without having significant impact on performance.
The most critical support is the ability to monitor dynamic system performance. While this obviously includes PE status and such local factors as CPU usage, memory usage, and page fault rates, that view alone is too narrow to serve as the foundation for general system tuning. In addition, it should be possible to monitor the run queues and current scheduling information on each PE, as well as system configuration information (e.g., I/O- or login-enabled nodes, disk assignments, operational external connections, etc.).
It is a fact of life that parallel and clustered systems are less robust than more traditional computers. Consequently, a reasonable amount of system debugging must generally be carried out on site - and this requires some additional software support. Primary among these is the availability of tools that make it possible to examine an operating system image, a core image, or the kernel running on a PE. Because the PE in question is often behaving badly, it must be possible for such examination to take place on a remote PE (or even another machine).
Finally, error logs remain one of the best sources of information on overall system behavior. Currently, many systems simply extend traditional serial support, creating an independent log on each PE. Not only does this impose too severe a burden on the system support group, it also can make it difficult or impossible to isolate problems involving multiple PEs. It is important that both system and application message logs be accessible though a centralized mechanism.
Users and system administrators alike complained that software documentation is not readily available on HPC platforms. Difficulty of access, plus the fact that documentation is often out-of-synch with current software releases, is frustrating to all concerned and adds to the burden of user support.
It is no longer the case that users have access to a library of software manuals. Often, users are located across the state or across the country from the machine site. Even when they are local, gaining physical access to manuals requires at least a trip to another building. Therefore, information that is not maintained online cannot be considered accessible to users. While recognizing that many of the issues involved in electronic publishing are still fuzzy, the group strongly recommended that documentation for all system software and tools be furnished online rather than in the traditional printed manuals. This applies to user guides, reference manuals, tutorials, and installation guides, as well as manpages.
The group also pointed out that such documentation needs to be kept in non-proprietary formats. Due to the increasing tendency to develop many components of HPC applications on machines other than the ultimate target, it can no longer be assumed that a proprietary browser will be readily available to all users. Moreover, many HPC systems involve software from a variety of sources, and users cannot be expected to keep up with unique browsers for each supplier. Providing documentation in SGML, HTML, and/or PostScript formats would greatly improve accessibility for users.
Two other problems were cited as common in current documentation systems. First, some vendors restrict access to online documentation to just certain PEs on a parallel machine. For obvious reasons, it should be accessible from any PE to which a user can log in. Second, mechanisms for updating documentation need to be made clearer and more effective. For example, when documentation is neither dated properly nor mapped to software release numbers, it becomes almost impossible to ensure that the materials are kept up-to-date. Documents should indicate explicitly not only the target software to which it applies, but also which previous documents are now superseded.
| <- HREF="ch8.html" Prev | Index | Next -> |
NHSE ReviewTM: Comments · Archive · Search