NHSE ReviewTM 1997 Volume First Issue

Establishing Standards for HPC System Software and Tools

| <- HREF="ch4.html" Prev | Index | Next -> |
NHSE ReviewTM: Comments · Archive · Search


Chapter 5 -- Tool Support for Debugging

The evolutionary nature of HPC applications - as one participant put it, "the only stable application is a dead application" - means that code is subjected to repeated cycles of modification or extension. Since each change introduces new possibilities for error, as well as perhaps unmasking errors that were already latent in the code, debugging activities occur intermittently throughout the lifetime of an application.

The group's debugging discussions focussed on four types of support: the availability of source-level stack traceback information, indicating the source location(s) of processes in a parallel application; requirements for a source-level interactive debugger; the ability to checkpoint the state of a parallel application, then re-load and re-start it from that point; and the consistency of where and when output to stdout/stderr appears. The first two are discussed in this chapter. The concept of application-level checkpointing was consolidated with job-level checkpointing, described under Operating System Services, while I/O considerations were grouped with other library issues in Low-Level Programming Interfaces.

A general source of concern to the group was the extreme diversity of existing HPC debuggers. Each vendor development group seems to have made a point of distinguishing its tool from those of its competitors by changing the syntax of operations and adding fine variations. Consider, for example, the concept of stepping through the program by executing one statement at a time. Each debugger not only changes the names associated with the operation, but adds a number of subtle sub-operations (as many as eight in one case). For the user who must use more than one HPC platform, or who is moving to a new platform, as much time must be spent in un-learning previous skills as in learning new ones. The group concurred that debuggers should support just a small number of simple operations, with the same operations and syntax available on all platforms; it recommended that a standard be established for such an interface.

A second source of concern was the fact that current HPC debuggers fail to take into account how code development activities are carried out. The lack of efficient support for the "edit/compile/execute/debug cycle" is as a prime example. On PCs and serial workstations, integrated environments link the debugger with other tool components in a number of ways to facilitate the localization and repair of errors, including the following:

On HPC systems, however, tools operate in isolation from each other. The burden is on the user to know how to invoke each tool, to provide it with the integrating information (such as file and line number), and then to issue the commands needed to restore the tool's previous state (breakpoint settings, etc.).

In response to questions from the vendors, users made it clear that they are not insisting on a single, integrated tool. The important factor is to reduce the number of steps and amount of effort (e.g., amount of typing, potential for syntax or spelling mistakes) required to move from one stage in the problem-solving cycle to the next. This can be accomplished through short cuts - such as buttons, control keys, or hot-spots - that streamline the procedures and by designing individual tools to retain historical context so that it is not necessary to reconfigure tool settings each time the cycle is repeated.

5.1 Source-level Location Information

To understand why source-level location support is needed, consider a parallel application that has been executing for some time. Suddenly, one of the PEs halts with a segmentation violation, divide-by-zero condition, or any other fatal error. Without access to the last source-level location for the faulting process, it is quite difficult to identify where the error occurred. In that case, the programmer must guess the approximate location (on the basis of whatever output has been generated by the application), then insert print statements in an attempt to bracket the offending section of code, re-compile and build the application, and run it in an attempt to re-create the failure. For a large or a long-running application, significant amounts of time are required for this kind of compilation and re-execution.

The tool support needed to solve the problem was obvious to the users. When any kind of interrupt or failure occurs, the runtime environment should save a record of the current location of each process, in a form that will make it possible to re-construct the subprocedure and source line locations for the current routine and for each routine in the stack.

Users also were clear that potential difficulties, such as stack corruption or the effects of code restructuring, should not deter implementation of this useful facility. If the code has been transformed, it is understood that the reported location might be only approximate, but that information is better than the alternative of no information at all. Similarly, even a corrupted stack should be able to yield the current subprocedure's location, which in itself can be valuable.

From the perspective of usefulness, the key factors are that the source location information be available to the programmer:

  1. for any interrupt or trap, not just application failure,

  2. without having to execute the application under the control of a debugger (for example, if the application is executing in batch mode), and

  3. without having to invoke a full-blown debugger in order to view the source-level information.
It was noted that although most parallel debuggers offer facilities for viewing tracebacks, they are unwieldy because they satisfy only the first, and perhaps the second, condition. Most applications need to be re-compiled (for debugging) and possible as well as re-executed (under control of the debugger) in order to identify the source location. The only tool satisfying all three conditions is the Parallel Tools Consortium's Lightweight Corefile Browser. That, however, is a small, narrowly focussed tool. What is needed is a whole collection of capabilities that are just as simple to access.

Links to the Guidelines document:

Stack traceback utilities in the Baseline Development Environment
All requirements related to stack traceback utilities

5.2 Source-level Interactive Debugger

Discussion then turned to the specific features needed in an interactive parallel debugger. In many cases, the suggestions were phrased as "don'ts" (i.e., current debugger restrictions that are onerous to users) rather than "musts." For example, users were opposed to debuggers that support only a graphical (or only a command-line) interface, since there are circumstances where each is more appropriate; also, some individuals simply prefer one approach over the other. Similarly, the tool should not restrict the user to interactive execution from the start of the program. It should be possible to perform "post-mortem" debugging of a full core dump, even if the user is only allowed to examine (and not to modify) the contents of variables, etc. It also should be possible to intermittently attach and detach the debugger from the program, so that the user can periodically check on the status of a long-running application.

A number of criticisms were directed at the clumsy or tedious nature of current debugger operations. While users do want the flexibility of being able to limit operations to subsets of the processes, the mechanism should not interfere with the simplest types of debugger use. In particular, it should be extremely easy to apply an operation to all processes - with no extra typing or mousing required. Yet many tools force the user to cope with process set mechanisms at all times (e.g., commands must explicitly specify that all processes should be affected, or must be made from within a specific window). In the worst case, some tools make it necessary to have one window or one set of commands per process. This was judged unacceptable by users; the tool must be capable of integrating the processes, at least to the extent of providing a single "console" for controlling all processes in the parallel program.

In general, the feeling was that debugger interfaces lag considerably behind the state of the art in serial programming environments, particularly those on personal computers. One-step point-and-click mechanisms are not exploited enough, or are not being used to access the functionality most desired by users, such as one-step breakpoint setting or viewing the contents of variables. Moreover, the tools are slow in responding to commands, often misleading users into repeating a command that is simply taking a long time to execute. Users also questioned why HPC debuggers don't follow the norm of the serial world, providing visual feedback (e.g., a clock icon, bar representing proportion of work accomplished so far, etc.) to indicate that an operation is a lengthy one. The ability to cancel lengthy operations is also important; even if the operation can't be un-done, the user should be able to halt its progress. Online help is particularly important in this context, but no current debugger provides adequate online information.

Another problem that is not addressed adequately by current debuggers is the visual display of array data. Most HPC applications define extremely large arrays, for which textual displays are totally inadequate. The group was amenable to restricting the displays to 2- (or possibly 3-) dimensional slices in order to simplify the problem. Needed visual displays include simple visualizations such as bitmaps (where colors indicate value thresholds) and contour plots, as well as tabular displays of numbers; it would also be desirable for debuggers to interoperate with common rendering packages for more sophisticated visualizations.

The group also discussed the importance of making debuggers less restrictive in terms of source language. Many current applications include modules written in C, Fortran77, and possibly Fortran90 or C++. It is not acceptable to have to learn separate debuggers for each language. Related to this is the issue of languages that may be translated into another source language, such as HPF and C++. It is important that the debugger present the user with the language he or she used, even if locations are only approximate; otherwise, it is almost impossible to identify the source line that should be modified.

Finally, at least some debugging must be possible in the presence of optimization, although it is understood that locations may be approximate and that some functionality may be unavailable. The important thing is to provide at least minimal information without forcing the user to re-write the code or re-compile it at a different level of optimization, since such changes not only consume time but also tend to mask or eliminate errors.

Links to the Guidelines document:

Interactive debugging tools in the Baseline Development Environment
All requirements related to interactive debugging tools:
Interactive debugger
Debugger interface issues
Debugger data displays

Copyright © 1996


| <- HREF="ch4.html" Prev | Index | Next -> |
NHSE ReviewTM: Comments · Archive · Search


Copyright © 1997 Cherri M. Pancake