NHSE ReviewTM 1997 Volume First Issue

A Survey of MPI Implementations

| <- Prev | Index | Next -> |
NHSE ReviewTM: Comments · Archive · Search


Performance Considerations with User-Defined Datatypes

  One of the most interesting features of MPI is the ability for applications to define MPI datatypes. MPI datatypes can describe almost any C or Fortran data object except C structures with pointers (there is no easy way to automatically ``follow'' the pointer).

User-defined datatypes are part of MPI for two basic reasons: they allow automatic data conversion in heterogeneous environments and they allow certain performance optimizations. The PVM Pack/Unpack approach allows automatic conversion in heterogeneous environments while the MPI approach additionally allows optimizations such as using special hardware or a coprocessor to perform scatter/gather, or pipelining scatter/gather with message transfer.

Unfortunately, while this is a nice idea in principle, most implementations do not implement these optimizations and using certain MPI datatypes can dramatically slow down communication performance. The two important cases are:

In both these cases, the implementation resorts to very slow copying of the send/receive buffer into/out of an internal MPI buffer, resulting in greatly reduced bandwidth -- as much as two orders of magnitude slower than for simple messages. In these cases, it is often more efficient for the user to explicitly pack and unpack the data to and from user-managed buffers (sending as an array of MPI_BYTE or some other simple datatype) than to let MPI do it automatically.

This is not the ``MPI way'' of doing things and may not always be the most efficient. For the foreseeable future, compiled user code will usually be able to pack data faster than an MPI library (but not specialized hardware). The wildcard is multithreaded MPI implementations where a thread running on another processor can pack data overlapped with computation.

One exception to the above rule is sending strided arrays on the T3E, where specialized hardware can send strided arrays faster than an MPI program can pack them. However, the hardware kicks in only if you use MPI_TYPE_VECTOR, not an contiguous array of building blocks that contain data followed by ``holes.''

I therefore reluctantly suggest that MPI applications implement two different methods for sending non-contiguous data. One should use the ``MPI way'' with non-contiguous datatypes, the other should pack into a user-managed buffer, and the choice should be made at compile time or runtime. Unfortunately this is not an elegant solution, but it is the best available at this time.

Copyright © 1996


| <- Prev | Index | Next -> |
NHSE ReviewTM: Comments · Archive · Search