NHSE ReviewTM 1997 Volume First Issue

Establishing Standards for HPC System Software and Tools

| <- HREF="ch5.html" Prev | Index | Next -> |
NHSE ReviewTM: Comments · Archive · Search


Chapter 6 -- Tool Support for Performance Tuning

Application developers are driven to HPC platforms because they need to get results faster, solve larger problems, or increase the resolution or complexity of their results. Consequently, the ability to evaluate an application's efficiency and apply that analysis to improve it, are central to the HPC application development process.

The group addressed requirements for several different types of tools to assist in measuring and tuning application performance. Over the course of the discussions, they were categorized generally as: profiling tools for acquiring general timing information; event-tracing tools, providing very detailed information on the order in which application events occurred; tools providing statistical information on how resources were used by the application; and libraries for acquiring specific metrics such as time or instruction counts. The first three are described below. Issues related to programmatic interfaces were re-assigned by the group, and will be found with other library issues in Low-Level Programming Interfaces.

As with debuggers, the group was concerned about the extreme diversity of existing performance tools. Not only do tools vary significantly in terms of setup procedures, appearance, and command structure, but also in basic capabilities. While it is understandable that tool developers are limited by the performance measurement capabilities of the underlying architecture, there is still too much variability in the basic definitions that underlie data collection (for example, the meaning of "time", or what constitutes an "event"). As a result, the user who moves from one HPC platform to another must learn not just a new tool interface, but a whole new context for application tuning.

Like debuggers, performance tools are far below acceptable levels in terms of integration into the cycle of code development activities. As in debugging, the programmer may have to go through a lengthy sequence of operations (perhaps even requiring manual instrumentation of program code) in order to activate performance measurement. What's more, performance tools are remarkably poor in providing linkages to the source code. A programmer who discovers a debilitating pattern of cache use, for example, will probably have to write down hexadecimal memory addresses, manually correlate them with source-level routine names from a compiler or loader map, then guess which statements in that routine are responsible, all because the tool neglected to provide source-level information about program locations.

Another area where current tools fail to take the code development cycle into account is their all-or-none approach to gathering runtime information. While the user may be able to specify that data be collected only from particular routines, that is often the only granularity control available; typically, it is not possible to restrict monitoring to specific (compute-intensive) loops or regions of code. Similarly, the particular types of data gathered (e.g., timing values, instruction counts, cache misses, message sends) often cannot be varied over the duration of execution. This puts users in the position of recording and storing a superset of what they want, then having to filter out the unwanted information manually.

Finally, the incapability of most current tools to deal with "MIMD-style" programs - that is, applications involving more than one executable - was considered an unwarranted restriction. As applications grow in size and complexity, they inevitably include more than a single executable image; one emerging trend is the desire to link entire applications or models into extremely large meta-applications that push the capabilities of HPC resources. Yet many current tools are incapable of measuring the performance of anything beyond a SPMD model. This precludes the analysis of precisely those applications which need performance tuning most.

6.1 Application Timing Information

Regardless of whether the application is to be ported to other platforms or simply maintained and enhanced over a period of time, HPC programmers are in the habit of measuring "baseline" or "benchmark" execution behavior. (Although such behavior refers to both the performance characteristics of the application and the data it produces, here we are concerned with just performance aspects.) A profile of execution timings is important in this respect. Not only does it helps identify which portions of the application would benefit most from code tuning or algorithmic improvements, it also serves as a basis for assessing performance after the code has been ported or modified.

Current profiling tools provide adequate facilities for gathering the raw timing data. Techniques for monitoring parallel applications have been known for some time, as have a variety of data compression methods. The tools were criticized by the group, however, for failing to address two basic aspects of user support: control over the scope of monitoring, and techniques for analyzing and presenting results.

With few exceptions, current profilers allow the programmer to specify which subprocedures should be monitored, but do not offer controls over the granularity of timer sampling. Users noted that application structure is far from uniform; typically, the most meaningful timing information is at the level of compute-intensive loops or code regions. Profilers need not provide statement-level control, but they should be capable of reporting information at the level of "coarse blocks" (e.g., large loops).

Given the size and duration of many current HPC applications, there should also be a mechanism for gathering data only during specific intervals. This should be accomplished through a public calling interface that allows sampling to be activated and deactivated dynamically from within the program.

Even more significant improvements are needed in the presentation of timing results. Current tool displays are extremely crude, reflecting technology from the early days of vector processing: simple TTY-style output, generated independently for each process involved in the application. This is totally inadequate for examining the timing characteristics of a large and complex application. Instead, the tool should be responsible for consolidating data from multiple processes and presenting it in summary form. To facilitate effective use of the information, that presentation should also provide access to the source code for the respective procedure/block via point- and-click mechanisms.

Link to the Guidelines document:

All requirements related to profiling tools

6.2 Information on Application Events

The notorious unpredictability of parallel application behavior makes it critical for the programmer to be able to study what events actually occurred, and in what order. Many current HPC platforms offer some type of event tracing tool. While they may have efficient data collection mechanisms, they suffer from problems similar to those of profiling tools. Conversely, the tools available from research groups may have better facilities for presentation and control, but their monitoring facilities are weaker since that do not have the same level of access to hardware and OS support. One solution proposed by the group is that all event tracing tools employ SDDF, or self-defining data format, so that the tracefiles can be read and used by more than one proprietary tool. This would allow the user to gather data using the most efficient, platform-specific mechanism, yet be able to analyze it using two or three higher-level tools.

As with profiling tools, event support should be expanded to allow the user to restrict the regions of execution where trace records are generated. Again, this should be accomplished through a public calling interface that allows monitoring to be activated and deactivated dynamically from within the program.

The group also stressed the importance of supporting basic event monitoring of message operations on all machines offering message-passing libraries. This does not mean that event tools must necessarily distinguish among all variations of message operations (e.g., all 130+ MPI routines). Rather, the operations can be aggregated into the general categories used by the Parallel Tools Consortium's Message Queue Manager tool [2], which supports interactive viewing of the message system.

Finally, like profilers, event tracing tools should be responsible for consolidating data from multiple processes and presenting it in summary form. To facilitate effective use of the information, that presentation should also provide access to the corresponding source code location via point-and-click mechanisms. The users emphasized that while the source location could be approximate (i.e., to within perhaps a dozen lines), it was important that the tool be capable of linking to a source code display.

Links to the Guidelines document:

Event tracing tools in the Baseline Development Environment
All requirements related to event tracing tools

6.3 Information on the Application's Resource Utilization

HPC programmers are extremely conscious of the importance of data locality to performance, but they are somewhat less comfortable with understanding exactly how locality might be improved consistently within a complex HPC execution environment. Consequently, they require access to information on how hardware resources are being used during execution.

The resources of most concern to the group were: memory use (in terms of both amount and general layout), cache behavior, paging behavior, communication traffic (both messaging and I/O traffic), and operations (in general, and floating-point in particular). Tool support should be provided for acquiring cumulative statistics on how these resources are utilized and storing the data. Analysis tools should present the information when the user needs it - either on-the-fly or after execution has completed. In addition, the tools should be capable of aggregating or consolidating the information from distinct processes so that the user can compare them to analyze the entire application's resource utilization.

Links to the Guidelines document:

Performance statistics tools in the Baseline Development Environment
All requirements related to performance statistics tools

Copyright © 1996


| <- HREF="ch5.html" Prev | Index | Next -> |
NHSE ReviewTM: Comments · Archive · Search


Copyright © 1997 Cherri M. Pancake