Fine-Grain MPI for Extreme-Scale Programming
Leader: Alan Wagner <wagner@cs.ubc.ca>
Estimated time: 2 hours
Background
Scalable computing is needed to meet the challenge of computing on
ever-increasing amounts of data. Multicore and many-core adds more
processing cores but still cannot keep pace with the demand for more
computation. There is need for technologies to scale, not only to
multiple cores within one chip but also to scale across multiple chips
in one machine, and across machines in a data center.
Process Oriented programming is the design of programs as a collection
of processes that communicate using message-passing. Process-oriented
programs execute concurrently either by: (a) logical concurrency:
sharing one processor core and taking turns executing, or (b) physical
concurrency: executing in parallel on separate cores in one or more
machines. Extreme scale programming is the term we use for the design
of programs that use thousands and millions of such processes.
Discussion
Why extreme scale programs? You cannot take advantage of thousands and
millions of cores if you cannot express thousands and millions of
independently executable processes. Message-passing is the only
communication mechanism that is able to scale on one machine, many
machines inside a data center or across networks including the Internet.
The only systems, in common use today, that may possibly scale to extreme size
are SPMD programs where multiple copies of the same program are started in
parallel, or Hadoop-like processing where there is a fixed framework.
These two approaches represent only a small subset of the type of
extreme-scale programs possible and are only a subset of more general
problems that could be tackled at the extreme-scale level. The ability
to design Extreme-scale programs that can scale is essential to meet the
challenge of data intensive computing. As the volume and velocity of
data increases so too does the costs in computing on that data and there
is a corresponding need for programs to scale to use more and more
processors.
How do we execute extreme-scale programs to enable us to have thousands
and millions of processes? It is challenging to design programs to
execute on more than a few processors and challenging to muster and
provision that many resources.
Fine-Grain MPI
(FG-MPI) is a modified
version of MPI created to support Extreme-scale programming. FG-MPI is
derived from MPICH2, which is a widely used version of MPI, and makes it
possible to express MASSIVE amounts of concurrency as a combination of
multiple MPI processes inside a single OS processor in concert with
having multiple of these OS-processes running on multiple cores across
multiple machines. This adds an extra degree of freedom that makes it
possible to adjust the execution of a program to use some degree of
logical concurrency with physical concurrency. The ability to add
logical concurrency decouples the number of processes from the hardware
making it possible to execute many more processes than there are
physical cores.
For example we recently demonstrated the ability for
FG-MPI to execute over 100 Million MPI processes (Exascale number of
processes) using 6,500 processing cores in a compute cluster. In this
way FG-MPI gives an extra scaling factor that in the small allows
programmers to develop programs using thousands of processes on a
workstation and then take those same programs and scale them, to what we
believe can be billions of processes on a large cluster. Today, because
MPI binds processes to OS-processes, one cannot design an extreme-scale
program without a correspondingly extreme number of machines. The
concurrency enabled by FG-MPI allows for in excess of a thousand-way
magnifier in scaling the number of processes. An FG-MPI program with
1000 processes can run on one OS-process on a single core and the same
program can also run in parallel with 1000 processes each on their own core.
Extreme-scale programming enables a new approach to programming large
systems. Once programs can have millions or billions of processes, one
can begin to imagine programs where each process models the activity of
real entities: whether these be traders in a stock exchange, neurons in
neural-computing applications, individuals in a social network, or
devices in a sensor net. We can begin to model the real world, explore
the properties that emerge, synchronize it with real data or explore new
domains.
One cannot develop these types of programs without the ability
to create enough processes to test the program with a realistic number
of entities. FG-MPI makes these types of programs possible.
As the name fine-grain implies, a motivation for the development of
FG-MPI was not simply the ability to have lots of processes it was also
the ability to express fine-grain concurrency. The usual coarse-grain
approach, where one identifies relatively large chunks of computation
that can be done in parallel, limits the ability to scale since the
parallelism cannot exceed the number of large chunks of computation;
to scale further can require a complete redesign of the program.
In FG-MPI, since we know there is more concurrency at the finer-grain
(functions) than at the coarse-grain (programs), the programmer can from
the start design the program with a large number of fine-grain
processes. Using FG-MPI we can now flexibly combine together processes
to execute in one OS-process (logically concurrent) to coarsen the
parallelism of the application. The main difference is that rather than
starting with coarse-grain parallelism and having to re-design the
program as we scale we start with many smaller processes and then
combine them together as necessary as we scale or port to different
machines. This allows the programmer to expose a massive amount of
concurrency and in way that can be flexibly mapped onto many types and
sizes of clusters.
The approach of combining together smaller processes into a single
OS-process is not a new idea. What makes this approach possible and what
is new was the idea of taking non-preemptive threads to implement
processes and integrating it directly into the middleware. Previous
approaches using other types of threads, or layering light-weight
threads over top of the middleware do not achieve the scalability and
performance possible with FG-MPI. In FG-MPI we are able to support
thousands of logically concurrent processes and we have carefully
integrated it into the MPICH2 middleware to reduce the overhead of
supporting multiple processes per OS-process. For example, we have shown
that even for the well-known NAS benchmarks we can improve performance
over existing MPI by adding some logical concurrency. Adjusting the
logical concurrency (size of the process) makes it possible to better
fit the data accessed by processes to the cache and smaller processes
leads to more frequent smaller messages and more fluidity. In most of
the NAS benchmarks these improvements out-weighed the cost of the added
concurrency. In conclusion, FG-MPI not only enables new types of
programs but also can improve existing MPI programs.
FG-MPI provides a unique opportunity for developing Extreme-scale
programs. There is the opportunity to make it available for widespread
use and create an open-source community around the technology. Because
FG-MPI is backwards compatible with existing MPI programs there is
already a large community to adopt this technology. As well the close
association of the FG-MPI group and MPICH2 group at Argonne provides a
clear path for incorporating FG-MPI into MPICH2. MPICH2 is a very
successful version of MPI that is widely uses as the basis of many
commercially available MPIs including Intel's MPI.
Compose-Map-Configure
The workshop will also consider another aspect of the engineering.
Parallel programming languages available today focus on the programmability
offered by the language in order to transition developers to thinking parallel,
but neglect the deployment and placement of those parallel tasks.
Two major factors to performance are the communication
overhead as well as the idle-processing overhead. With SPMD programs,
the work stealing approach employed by most languages is natural to
maximizing processing efficiency. However, with MPMD programs, the
amount of time each process spends on computation has more variation and
should not be treated equally. These processes may be seen as being
services with varying amount of up-time and benefit from being placed
statically to reduce unnecessary process migration. Furthermore, simply
placing a process based on computational requirements neglect the
communication overhead which play a large role in performance for many
applications.
With FG-MPI, we can exploit locality by statically placing
processes on the available computational resources as well as having
processes run concurrently to reduce communication and idle-processing
overheads. However, with this new flexibility comes a need to tweak and
specify these options quickly. We propose a Compose-Map-Configure (CMC)
tool that employs a four-stage approach to software specification and
creates a separation of concerns from design to deployment. This allows
decisions of architecture interactions, software size, resources
capacity, and function to be defined separately. Main ideas include the
encapsulation of parallel components using hierarchical process
compositions, and a corresponding channel communication support.
Opportunities are available to simplify compositions of separately
developed services and applications as well as integration with
optimization tools to explore and adapt computation to the available
resources.
An example of the CMC process will be presented in the workshop, showing how we
go from processes plus specification to execution.
Feedback on the ideas behind the CMC tool will be sought and very welcome.
Action
The purpose of the workshop will be a practical hands-on introduction to
FG-MPI: how to create FG-MPI programs and run them using ``mpiexec''.
I will discuss the added flexibility in executing programs and limitations.
I will discuss applications and tools we have started to develop and
potential extensions.
Fine-Grain MPI
(FG-MPI) extends the MPICH2 runtime to support execution of
multiple MPI processes inside an OS-process. FG-MPI supports fine-grain,
function level, concurrent execution of MPI processes by minimizing
messaging and scheduling overheads of processes co-located inside the
same OS process. FG-MPI makes it possible have thousands MPI processes
executing on one machine or millions of processes across multiple
machines.
For the practical part of this workshop, it will help to have downoaded
the latest FG-MPI release, together with the code examples on the download
page. These are available from the "Downloads" tab on the
home page.
However, the work can still be followed with just pencil and paper.
|