| <- HREF="chap5.html" Prev | Index | Next -> |
NHSE ReviewTM: Comments · Archive · Search
PROGRAM two_triples REAL, DIMENSION(30,40) :: a1, a2, b1, b2 !HPF$ DISTRIBUTE a1(BLOCK, BLOCK) !HPF$ ALIGN (:,:) WITH a1(:,:) :: a2, b1, b2 CALL init( a1, b1) a2(1:29, 1:39) = a1(2:30, 2:40) b2(2:30, 2:40) = b1(1:29, 1:39) CALL use( a2, b2) END
For any processor arrangement (with greater than one processor)
elements of the a1 and b1 arrays at the "edges" of the local blocks
will need to be communicated to an "adjacent" processor. The two
subroutine calls (to init
and use
) have been
inserted so that an extremely clever optimizer in any compiler could
not immediately deduce that a1 or b1 were not initialized or that a2
or b2 were never used after the assignments.
Before examining the parallel-code translation produced by xHPF, it is useful to compare what was generated within the translator in analyzing the Fortran 90 source. xHPF makes an intermediate translation from Fortran 90 to semantically-equivalent (serial) FORTRAN 77 source code prior to doing its final parallelism analysis and translation for parallel execution. (This partially stems from the product's orientation as a translator of FORTRAN 77-based HPF codes and its automatic parallelization focus on that language.) How that translation is performed may have an ultimate effect on the manner in which the communication code is organized. In the following command-line invocation to xHPF:
xhpf -keep -verbose=20 -ompf=two_triples_f90_mpf.f \ -BenignUnknown two_triples.f
the -keep
causes the intermediate FORTRAN 77 file to be
preserved:
PROGRAM two_triples REAL a1(30, 40), a2(30, 40), b1(30, 40), b2(30, 40) CHPF$ DISTRIBUTE a1(BLOCK, BLOCK) CHPF$ ALIGN (:, :) WITH a1(:, :) :: a2, b1, b2 INTEGER apr_opt_int_arg CHARACTER apr_opt_char_arg COMMON /apr_res1/apr_opt_int_arg/apr_res2/apr_opt_char_arg INTEGER a, a0, a3, a4, a5 CALL init(a1, b1) DO a3 = 1, 39 DO a0 = 1, 29 a2(a0, a3) = a1(1 + a0, 1 + a3) b2(1 + a0, 1 + a3) = b1(a0, a3) ENDDO ENDDO CALL use(a2, b2) END
Notice that, apart from the extra declarations, anticipating aspects of the xHPF runtime system, the translation from Fortran 90 to FORTRAN 77 is optimal in that there has been recognition that the subscript triples in both statements are mutually conformal, and that there is no "overlap" in elements between right-hand-sides and left-hand-sides, or between any fetches or stores between the two assignments (there are no true-, loop-carried-, or anti-dependences).
The communication code generated for these two assignments is:
CALL dd_dstloop(10, 1, 39, 1, dtx, dtx0, dtx1, a2, 162, -11, 1, 1 . , 29, 3, 1, 1, 10) CALL dd_use_by_dl(a1, 126, -11, 2, 1, 29, 3, 2, 1, 10) CALL dd_set_by_dl(b2, 144, -11, 2, 1, 29, 3, 2, 1, 10) CALL dd_preloop_xchng(13, 17, 'two_triples.f.F77', dtx, dtx0, dtx1 . ) DO a3 = dtx, dtx0, dtx1 DO a0 = 1, 29 a2(a0, a3) = a1(1 + a0, 1 + a3) b2(1 + a0, 1 + a3) = b1(a0, a3) ENDDO ENDDO CALL dd_postloop_xchng(19, 17, 'two_triples.f.F77')
seen in full context in the entire generated code. The parallelism has defaulted to a single dimension, the rightmost of each array, as in our other examples using xHPF.
Examination of the communication pattern of the above (e.g., with a
software communications-trace-analyzer on the SP) shows that the
initial and final communications patterns, the
...preloop...
and ...postloop...
, are
strongly influenced by the adjacent calls to the init
and
use
subroutines. xHPF has no interface information for
those two subroutines so it:
Despite any excessive communication before or after the two assignments, one can see from the generated code that there will be no extra communication within either the "do a3" or inner "do a0" loops. This exhibits xHPF's extended implementation of "owner computes" as not necessarily being tied to assignment left-hand-side array references. Also, the APR runtime system is prepared to move only those values of a1 (before the loop nest) or b2 (after the loop nest) that are not local to each processor's elements of a2 (equivalently, b1) in the subscript range (1:29, 1:39).
The pghpf translation does not take advantage of the conformability of the four array sections and the lack of dependences when rendered to per-processor code, even when the translator is invoked with -O3 optimization level. It generates communication and loop control for each assignment statement:
cp$$com = pghpf_olap_shift(a1(a1$p),a1$d2(a1$dp2),1,1,1,1) xfer$$com = pghpf_comm_start(cp$$com,a1(a1$p),a1$d2(a1$dp2),a1( +a1$p),a1$d2(a1$dp2)) call pghpf_comm_finish(xfer$$com) call pghpf_localize_bounds(b2$d1(b2$dp1),1,1,29,1,i$$l,i$$u) call pghpf_localize_bounds(b2$d1(b2$dp1),2,1,39,1,i$$l1,i$$u1) ! forall (i$i=i$$l1:i$$u1:1, i$i1=i$$l:i$$u:1) a2((u$$b2-l$$b2+1)*( ! +i$i-l$$b3)+i$i1-l$$b2+a2$p) = a1(i$i1+1+(u$$b-l$$b+1)*(i$i+1-l$$b1 ! +)-l$$b+a1$p) do i$i = i$$l1, i$$u1 do i$i1 = i$$l, i$$u a2((u$$b2-l$$b2+1)*(i$i-l$$b3)+i$i1-l$$b2+a2$p) = a1(i$i1+1+ +(u$$b-l$$b+1)*(i$i+1-l$$b1)-l$$b+a1$p) enddo enddo xfer$$com1 = pghpf_comm_start(cp$$com,b1(b1$p),b2$d1(b2$dp1),b1( +b1$p),b2$d1(b2$dp1)) call pghpf_comm_finish(xfer$$com1) call pghpf_localize_bounds(b2$d1(b2$dp1),1,2,30,1,i$$l2,i$$u2) call pghpf_localize_bounds(b2$d1(b2$dp1),2,2,40,1,i$$l3,i$$u3) ! forall (i$i=i$$l3:i$$u3:1, i$i1=i$$l2:i$$u2:1) b2((u$$b2-l$$b2+1)* ! +(i$i-l$$b3)+i$i1-l$$b2+b2$p) = b1(i$i1-1+(u$$b2-l$$b2+1)*(i$i-1- ! +l$$b3)-l$$b2+b1$p) do i$i = i$$l3, i$$u3 do i$i1 = i$$l2, i$$u2 b2((u$$b2-l$$b2+1)*(i$i-l$$b3)+i$i1-l$$b2+b2$p) = b1(i$i1-1+ +(u$$b2-l$$b2+1)*(i$i-1-l$$b3)-l$$b2+b1$p) enddo enddo call pghpf_comm_free(1,cp$$com)
Apart from some initial allocation of the per-processor dynamically-determined portion of each array, there isn't much more executable code in the complete generated source.
Parallel execution is indicated by comments containing FORALL, and
is along both array dimensions. For each assignment statement, all
communication is accomplished prior to the per-processor double-loop,
with the invocations of pghpf_olap_shift(...)
,
pghpf_comm_start(...)
, and
pghpf_comm_finish(...)
, using memory space managed with
bookkeeping in PGI's runtime system via the INTEGER variable cp$$com.
Note the freeing-up of that space after both assignments at the
call to subroutine pghpf_comm_free(...)
.
Note also that pghpf assumes that any user routine for which it has
not been supplied an INTERFACE block is an HPF routine. In the
generated source, the interface to each of init
and
use
has been expanded from two to four arguments: the
extra arguments give the PGI procedure-invocation runtime system
access to details of the mapping of each original array argument. Any
required remappings would be performed in each subroutine. This code
allocates only that part of the mapped array that is local to each
processor.
IBM XL HPF, for all that it has a well-integrated optimizer and has elsewhere been seen by CTC to generate impressively clean code, doesn't handle the communication for these two assignments any better than pghpf: each assignment is translated into its own, separate, communication actions followed by its own nested do-loop pair. Since the generated-code pseudo-Fortran listing is at such a low semantic level (although the MPI calls cannot be directly seen), a full extract of that listing for those two statements spans 128 lines. The full listing (approximately 20K bytes) can be viewed with that region hi-lighted. That listing has both the earlier, pre-"High Order Transform", annotated-pseudo-source "HPF Parallelization Report", and also the later "Loop Transformation Report". In fact, except for one extra declaration in the later portion, those listing portions are identical.
Each statement's essential computation is preceded by two sequences with general structure:
C 1585-501 Original Source Line ... if ( ...<processor to receive values>... ) then ...<initialize communication-control data structures>... call _xlhpf_nbreceive_section(<right-hand-side-array>, ... ) end if C 1585-501 Original Source Line ... if ( ...<processor to send values>... ) then ...<initialize communication-control data structures>... call _xlhpf_send_section(<right-hand-side-array>, ... ) end if call _xlhpf_waitforall(1)Each one of the two receive/send/wait sequences deals with communication along one of the two mapped dimensions of the array.
Finally, the on-processor essential computations (assignments, here)
are performed by do-loop nests. The assignment a2(1:29, 1:39) =
a1(2:30, 2:40)
generates:
C 1585-501 Original Source Line 9 do i_5=iown_l_15,MIN0(iown_u_16,39),1 C 1585-501 Original Source Line 9 do i_6=iown_l_17,MIN0(iown_u_18,29),1 a2_46(i_6,i_5) = a1_45(i_6 + 1,i_5 + 1) end do end doand the assignment
b2(2:30, 2:40) = b1(1:29, 1:39)
generates:
C 1585-501 Original Source Line 10 do i_5=MAX0(iown_l_15 - 1,1),MIN0(iown_u_16 - 1,39),1 C 1585-501 Original Source Line 10 do i_6=MAX0(iown_l_17 - 1,1),MIN0(iown_u_18 - 1,29),1 b2_48(i_6 + 1,i_5 + 1) = b1_47(i_6,i_5) end do end doLoop bounds, such as
iown_l_15
are
initialized in code early in the pseudo-Fortran, such as:
iown_l_15 = 1 + ((40 + PGB_10(2)) - 1) / PGB_10(2) * PID_11(2)proximate to the allocation of the mapped array portions for each processor, and prior to the communication code shown above.
Given the conditional subroutine calls for communication, and the different expressions used for each loop-nest's bounds, it is highly unlikely that any later optimization phase of the XL HPF compiler could combine the two array assignments in the way that the xHPF compiler has.
The init
and use
subroutine calls generate
no extra code in the calling program and appear to take arguments that
are pointers to run time descriptor structures for each of their array
arguments (see the POINTER declaration and commentary in the
pseudo-Fortran for each of a1, a2, b1, and b2).
| <- HREF="chap5.html" Prev | Index | Next -> |
NHSE ReviewTM: Comments · Archive · Search