In our third case study,
we use the example of matrix-matrix
multiplication to illustrate issues that arise when developing data
distribution neutral libraries. In particular, we consider the
problem of developing a library to compute C = A.B
, where
A
, B
, and C
are dense matrices of size
N
N
. (A dense matrix is a matrix in which most of the
entries are nonzero.) This matrix-matrix multiplication involves
operations, since for each element
of C
, we must
compute
We wish a library that will allow each of the arrays A , B , and C to be distributed over P tasks in one of three ways: blocked by row, blocked by column, or blocked by row and column. This library may be defined with a subroutine interface suitable for sequential composition in an SPMD program or with a channel interface suitable for parallel or concurrent composition. The basic algorithmic issue remains the same: Does the library need to incorporate different algorithms for different distributions of A , B , and C , or should incoming data structures be converted to a standard distribution before calling a single algorithm?
Figure 4.10: Matrix-matrix multiplication A.B=C
with matrices A
,
B
, and C
decomposed in one dimension. The components of
A
, B
, and C
allocated to a single task are shaded
black. During execution, this task requires all of matrix
A
(shown stippled).
We start by examining algorithms for various distributions of A
,
B
, and C
. We first consider a one-dimensional, columnwise
decomposition in which each task encapsulates corresponding columns
from A
, B
, and C
. One parallel algorithm makes each
task responsible for all computation associated with its . As
shown in Figure 4.10, each task requires all of matrix
A
in order to compute its
.
data are required
from each of P-1
other tasks, giving the following per-processor
communication cost:
Note that as each task performs computation, if
N
P
, then the algorithm will have to transfer roughly
one word of data for each multiplication and addition performed.
Hence, the algorithm can be expected to be efficient only when
N
is much larger than P
or the cost of computation is much larger
than
.
Figure 4.11: Matrix-matrix multiplication A.B=C
with matrices A
,
B
, and C
decomposed in two dimensions. The components of
A
, B
, and C
allocated to a single task are shaded
black. During execution, this task requires corresponding rows and
columns of matrix A
and B
, respectively
(shown stippled).
Next, we consider a two-dimensional decomposition of A
, B
,
and C
. As in the one-dimensional algorithm, we assume that a task
encapsulates corresponding elements of A
, B
, and
C
and that each task is responsible for all computation associated with
its . The computation of a single element
requires
an entire row
and column
of A
and B
,
respectively. Hence, as shown in Figure 4.11, the
computation performed within a single task requires the A
and
B
submatrices allocated to tasks in the same row and column,
respectively. This is a total of
data,
considerably less than in the one-dimensional algorithm.
Figure 4.12: Matrix-matrix multiplication algorithm based on two-dimensional
decompositions. Each step involves three stages: (a) an
A
submatrix is broadcast to other tasks in the same row; (b) local
computation is performed; and (c) the B
submatrix is rotated
upwards within each column.
To complete the second parallel algorithm, we need to design a strategy for communicating the submatrices between tasks. One approach is for each task to execute the following logic (Figure 4.12):
for j
=0 to
in each row i
, the
accumulate
send
endfor
set
th task broadcasts
to the other tasks in the row
.
to upward neighbor
Each of the steps in this algorithm involves a broadcast
to
tasks (for A'
) and a nearest-neighbor
communication (for B'
). Both communications involve
data. Because the broadcast can be accomplished in
steps
using a tree structure, the per-processor communication cost is
Notice that because every task in each row must serve as the root of a broadcast tree, the total communication structure required for this algorithm combines a hypercube (butterfly) structure within each row of the two-dimensional task mesh and a ring within each column.
Figure 4.13: Reorganizing from a one-dimensional to a one-dimensional
decomposition of a square matrix when P=16. Shading indicates one
set of four tasks that must exchange data during the
reorganization.
Comparing Equations 4.3 with 4.4,
we see that the two-dimensional decomposition yields the more efficient
parallel algorithm. Does this mean that our parallel library should
convert input arrays decomposed in one dimension to a two-dimensional
decomposition before performing the matrix multiplication? To answer
this question, we need to know the cost of the reorganization. The
communication costs associated with the reorganization of a single
array are as follows; each task exchanges data with other
tasks, with each message having size
(Figure 4.13):
If A , B , and C are all decomposed in one dimension, we must perform three such conversions. This gives a worst-case total communication cost for reorganization and multiplication using the two-dimensional algorithm of
Comparing this expression with Equation 4.3, we see that the algorithm that reorganizes data structures to a 2-D decomposition before performing the multiplication will be more efficient than an algorithm that does not, when
This condition holds for all except small P . Hence, we conclude that our parallel matrix multiply library should convert to a two-dimensional decomposition before performing computation, as follows.
procedure matrix_multiply(A, B, C)begin
if 1d_distributed(A) then reorg_to_2d(A)
if 1d_distributed(B) then reorg_to_2d(B)
2d_matrix_multiply(A, B, C)
if 1d_distributed(C) then reorg_to_1d(C)
end
Figure 4.14: Layout of the A
and B
matrices in the systolic
matrix-matrix multiplication algorithm for a task mesh.
The arrows show the direction of data movement during execution of the
systolic algorithm.
We still have not said the last word about the ideal data distribution
for matrix-matrix multiplication! An alternative algorithm
allows the broadcast operations used in the preceding algorithm to be
replaced with regular, nearest-neighbor (``systolic'') communications.
However, data must be distributed among tasks in a different fashion.
As before, we assume that A
, B
, and
C
are decomposed into submatrices. Each task
(i,j)
contains submatrices
,
, and
, where
. This data layout is
illustrated in Figure 4.14.
Computation proceeds in steps. In each step, contributions
to C
are accumulated in each task, after which values of
A
move down and values of B
move right. The entire computation
requires a total of
messages per task, each of size
, for a cost of
Communication costs are less by a factor of about than in
Equation 4.4. Again, this benefit must be weighed
against the cost of converting matrices A
, B
, and
C
into the layout required by this algorithm. This analysis is left as
an exercise.
© Copyright 1995 by Ian Foster