| <- HREF="chap3.html" Prev | Index | Next -> |
NHSE ReviewTM: Comments · Archive · Search
PROGRAM three_assignments REAL, DIMENSION(30,40) :: a1, a2, b1, b2, c1, c2 !HPF$ DISTRIBUTE a1(BLOCK, BLOCK) !HPF$ ALIGN (:,:) WITH a1(:,:) :: a2, b1, b2, c1, c2 a2 = a1 b2 = b1 c2 = c1 call use_them(a2, b2, c2) END
The subroutine use_them
after the three assignments serves
to keep these translators and compilers from deducing that the arrays
a2
,b2
, and c2
are never used.
With a high level of optimization, these assignments might be entirely
discarded as "dead code". Similar implied usage will be seen in all
later examples.
The above is translated (in part) into the following:
DO a3 = dtx, dtx0, dtx1 dtx3=dtx3+1 dtx2=-1 DO a0 = 1, 30 dtx2=dtx2+1 a2(a0, a3)=a1(hi+dtx2*hi0+dtx3*hi1) b2(a0, a3)=b1(hi2+dtx2*hi3+dtx3*hi4) c2(a0, a3)=c1(hi5+dtx2*hi6+dtx3*hi7) ENDDO ENDDOwhich can be seen in context in the _mpf.f file generated by the translator. All three assignments are handled in one loop-nest: a good example of "loop jamming".
Note that each node has its own values for dtx
,
dtx0
and dtx1
computed prior to the loop by
APR support routines from the details of the current distribution of
each array. (See discussion of APR's Run Time
Model, below.) In fact, the form of this code would not change if
the mapping were changed from BLOCK to CYCLIC.
Each of the other scalar variables assigned-to in the loop body or used in the subscript expression are used to access the actually allocated array. APR reports that most native Fortran compilers, with sufficient optimization power, are able to simplify these expressions or recognize the loop-invariances, and that execution does not suffer due to their apparent complexity: IBM xlf is reported to have the required capability.
Note also that this is the code-form xHPF generates for "SHRUNK" array
allocation references (each processor containing only its own portion of
the mapped array) to each of the right-hand-sides (RHS),
a1
, b1
, and c1
-- the "FULL" array
allocation handling of a2
, b2
, and
c2
on the left-hand-sides (LHS) has significantly simpler
subscripts. xHPF has made this distinction because of the presence of
the call use_them(a2, b2, c2)
statement
after the three assignments: since it was not given the source code of
that subroutine to analyze for this conversion, it uses a generalization
that is an extension in xHPF on the HPF EXTRINSIC(HPF_LOCAL)
interface. This xHPF extension permits HPF procedures to invoke FORTRAN
77 procedures without having an INTERFACE block in the caller. To
support the extension (in the absence of other information about
use_them
), xHPF has allocated the full memory span for each
of the three LHS arrays, and has used simpler subscripting in their
accesses. However, each processor still "owns" only part of each array,
just as for the explicitly "SHRUNK" RHS arrays.
This translation used the default preference of xHPF for actually decomposing only on one dimension, the rightmost with either a BLOCK or CYCLIC keyword, regardless of the rank of the array or the number of mapped dimensions of the array. This default can be overridden by command-line arguments.
The code that precedes the loop in the generated _mpf.f file:
CALL dd_dstloop(10, 1, 40, 1, dtx, dtx0, dtx1, a1, 162, -11, 1, 1 . , 30, 3, 1, 1, 10) dtx3=-1 CALL dl_mem_by_dl(a1, 162, 5, hi, -11, 1, 1, 30, hi0, 3, 1, 1, 10 . , hi1) CALL dl_mem_by_dl(b1, 198, 5, hi2, -11, 1, 1, 30, hi3, 3, 1, 1, 10 . , hi4) CALL dl_mem_by_dl(c1, 180, 5, hi5, -11, 1, 1, 30, hi6, 3, 1, 1, 10 . , hi7) CALL dd_preloop_xchng(11, 25, 'three_assignments_2.f.F77', dtx, . dtx0, dtx1) CALL dl_modify(hi7, hi6, hi5, hi4, hi3, hi2, hi1, hi0, hi)anticipates APR's dynamic memory mapping model. The xHPF run time system accommodates more dynamic redistribution than the static HPF Subset directives would imply: the extra facilities would be available to users of the APR-specific !APR$ PARTITION... directive.
The dd_dstloop(...)
call determines the "control
distribution" (APR calls this "spreading" in its manuals) of the
parallelized "DO a3" loop (the rightmost dimension of each array) based
on an owner-computes analysis of the a1 array (any of the three "SHRUNK"
RHS arrays would suffice; if LHS arrays were "SHRUNK", one of them would
be selected for the control details). xHPF is notable for its selection
of "owner computes" control determination from either RHS or LHS array
references, sometimes on a statement-by-statement basis within a
converted parallel control structure (this will be seen more explicitly
in Chapter 6). Each of the dl_mem_by_dl(...)
calls
determines the current mapping for its subject array, filling-in values
that will be used in the loop. Note that a determination is made in
this routine about required inter-processor communication for any array
elements, but no actual communication is initiated yet. Then
dd_preloop_xchng(...)
compares the current node "locality"
recorded earlier in the "control distribution" with values saved within
the runtime system by each dl_mem_by_dl
call (a "work
list"), and actually performs any needed inter-processor communication.
Finally, dl_modify(...)
updates the scalar variables that
might be affected by such a communication.
For this trivial code with the directives as given, of course there is no required communication.
After the loop code there is:
CALL dd_postloop_xchng(18, 25, 'three_assignments_2.f.F77')This anticipates that some later array-using code might require a different mapping of the data, and performs any anticipatory inter-processor communication. Again, the saved "work list" is used to make the determination, and again, for this code, no communication would be needed except for the interface to the
use_them
invocation: if xHPF had been provided with the source for
use_them
and its mappings of the three arrays corresponded
with the caller's, no actual communication would be executed.Our experience with xHPF has shown that its overhead, even in these "null communication" cases is fairly small. One can get an actual measurement of that from the instrumented library that is part of the xHPF product.
With "O3" optimization, the three assignments are translated into the following:
! forall (i$i=i$$l1:i$$u1:1, i$i1=i$$l:i$$u:1) a2((u$$b2-l$$b2+1)*( ! +i$i-l$$b3)+i$i1-l$$b2+a2$p) = a1((u$$b-l$$b+1)*(i$i-l$$b1)+i$i1- ! +l$$b+a1$p) do i$i = i$$l1, i$$u1 do i$i1 = i$$l, i$$u a2((u$$b2-l$$b2+1)*(i$i-l$$b3)+i$i1-l$$b2+a2$p) = a1((u$$b- +l$$b+1)*(i$i-l$$b1)+i$i1-l$$b+a1$p) b2((u$$b2-l$$b2+1)*(i$i-l$$b3)+i$i1-l$$b2+b2$p) = b1((u$$b4- +l$$b4+1)*(i$i-l$$b5)+i$i1-l$$b4+b1$p) c2((u$$b2-l$$b2+1)*(i$i-l$$b3)+i$i1-l$$b2+c2$p) = c1((u$$b4- +l$$b4+1)*(i$i-l$$b5)+i$i1-l$$b4+c1$p) enddo enddowhich is seen in context in the Fortran file that can be saved during a pghpf translation. Loop jamming has been implemented in this case. (With no optimization in the compiler invocation, each assignment would be translated into its own loop.)
The commented-"forall" line is an abbreviated indication of the pghpf compiler's parallelization accomplishments. (Only the first assignment of the three is indicated in the comment -- this is not intended to be a correct, HPF FORALL statement.) Here both dimensions of distribution are implemented. Also, since this compiler determines that there can be no inter-processor communication, it generates no extra calls. PGI's pghpf has no facility corresponding to xHPF's "FULL" allocations, so all distributed arrays are locally allocated in per-node-portion form in the generated code.
XL HPF translates the three assignment statements as:
C 1585-501 Original Source Line 6 do i_5=iown_l_15,MIN0(iown_u_16,40),1 C 1585-501 Original Source Line 6 do i_6=iown_l_17,MIN0(iown_u_18,30),1 a2_28(i_6,i_5) = a1_27(i_6,i_5) b2_30(i_6,i_5) = b1_29(i_6,i_5) c2_32(i_6,i_5) = c1_31(i_6,i_5) end do end doas extracted from the pseudo-Fortran listing. Clearly there is good loop-jamming here.
This is the form of the code with the compiler's "-qhot" (High Order loop Transformations) optimization, and this is the portion of the listing elicited by the "-qreport=hotlist" command-line option. (Examination of earlier portions of that listing show all the invocation options and the more line-by-line "hpflist"-elicited HPF Parallelization Report.)
This compiler generated a two-dimensional parallel spreading of the assignments. The arrays are allocated locally with each processor holding only its portion. No communication is needed and so none is generated.
| <- HREF="chap3.html" Prev | Index | Next -> |
NHSE ReviewTM: Comments · Archive · Search