NHSE Review 1996 Volume Second Issue: Comparison of 3 Compilers -- Chapter 4

NHSE Review^TM 1996 Volume Second Issue

Comparison of 3 HPF Compilers

| <- HREF="chap3.html" Prev | Index | Next -> |
NHSE Review^TM: Comments · Archive · Search

Chapter 4 -- Loop Jamming

The following trivial program demonstrates how Fortran 90 assignments are "lowered" to per-node execution. No inter-processor communication should be needed for the three assignments. A literal translation of the Fortran 90 semantics to FORTRAN 77 without recognition of the conformance and alignment of all six arrays might lead to more control structure than is needed. All three compilers, APR xHPF, PGI pghpf (with appropriate levels of optimization requested in its invocation), and IBM XL HPF, employ a translation technique that can be called "loop jamming".

      PROGRAM three_assignments
      REAL, DIMENSION(30,40) :: a1, a2, b1, b2, c1, c2
!HPF$ DISTRIBUTE a1(BLOCK, BLOCK)
!HPF$ ALIGN (:,:) WITH a1(:,:) :: a2, b1, b2, c1, c2

      a2 = a1
      b2 = b1
      c2 = c1

      call use_them(a2, b2, c2)

      END

The subroutine use_them after the three assignments serves to keep these translators and compilers from deducing that the arrays a2,b2, and c2 are never used. With a high level of optimization, these assignments might be entirely discarded as "dead code". Similar implied usage will be seen in all later examples.

4.1 APR xHPF

The above is translated (in part) into the following:

      DO a3 = dtx, dtx0, dtx1
      dtx3=dtx3+1
      dtx2=-1
        DO a0 = 1, 30
      dtx2=dtx2+1
      a2(a0, a3)=a1(hi+dtx2*hi0+dtx3*hi1)
      b2(a0, a3)=b1(hi2+dtx2*hi3+dtx3*hi4)
      c2(a0, a3)=c1(hi5+dtx2*hi6+dtx3*hi7)
        ENDDO
      ENDDO

which can be seen in context in the _mpf.f file generated by the translator. All three assignments are handled in one loop-nest: a good example of "loop jamming".

Note that each node has its own values for dtx, dtx0 and dtx1 computed prior to the loop by APR support routines from the details of the current distribution of each array. (See discussion of APR's Run Time Model, below.) In fact, the form of this code would not change if the mapping were changed from BLOCK to CYCLIC.

Each of the other scalar variables assigned-to in the loop body or used in the subscript expression are used to access the actually allocated array. APR reports that most native Fortran compilers, with sufficient optimization power, are able to simplify these expressions or recognize the loop-invariances, and that execution does not suffer due to their apparent complexity: IBM xlf is reported to have the required capability.

Note also that this is the code-form xHPF generates for "SHRUNK" array allocation references (each processor containing only its own portion of the mapped array) to each of the right-hand-sides (RHS), a1, b1, and c1 -- the "FULL" array allocation handling of a2, b2, and c2 on the left-hand-sides (LHS) has significantly simpler subscripts. xHPF has made this distinction because of the presence of the call use_them(a2, b2, c2) statement after the three assignments: since it was not given the source code of that subroutine to analyze for this conversion, it uses a generalization that is an extension in xHPF on the HPF EXTRINSIC(HPF_LOCAL) interface. This xHPF extension permits HPF procedures to invoke FORTRAN 77 procedures without having an INTERFACE block in the caller. To support the extension (in the absence of other information about use_them), xHPF has allocated the full memory span for each of the three LHS arrays, and has used simpler subscripting in their accesses. However, each processor still "owns" only part of each array, just as for the explicitly "SHRUNK" RHS arrays.

This translation used the default preference of xHPF for actually decomposing only on one dimension, the rightmost with either a BLOCK or CYCLIC keyword, regardless of the rank of the array or the number of mapped dimensions of the array. This default can be overridden by command-line arguments.

4.1.1 xHPF Run Time Model

The code that precedes the loop in the generated _mpf.f file:

      CALL dd_dstloop(10, 1, 40, 1, dtx, dtx0, dtx1, a1, 162, -11, 1, 1
     .    , 30, 3, 1, 1, 10)
      dtx3=-1
      CALL dl_mem_by_dl(a1, 162, 5, hi, -11, 1, 1, 30, hi0, 3, 1, 1, 10
     .    , hi1)
      CALL dl_mem_by_dl(b1, 198, 5, hi2, -11, 1, 1, 30, hi3, 3, 1, 1, 10
     .    , hi4)
      CALL dl_mem_by_dl(c1, 180, 5, hi5, -11, 1, 1, 30, hi6, 3, 1, 1, 10
     .    , hi7)
      CALL dd_preloop_xchng(11, 25, 'three_assignments_2.f.F77', dtx, 
     .    dtx0, dtx1)
      CALL dl_modify(hi7, hi6, hi5, hi4, hi3, hi2, hi1, hi0, hi)

anticipates APR's dynamic memory mapping model. The xHPF run time system accommodates more dynamic redistribution than the static HPF Subset directives would imply: the extra facilities would be available to users of the APR-specific !APR$ PARTITION... directive.

The dd_dstloop(...) call determines the "control distribution" (APR calls this "spreading" in its manuals) of the parallelized "DO a3" loop (the rightmost dimension of each array) based on an owner-computes analysis of the a1 array (any of the three "SHRUNK" RHS arrays would suffice; if LHS arrays were "SHRUNK", one of them would be selected for the control details). xHPF is notable for its selection of "owner computes" control determination from either RHS or LHS array references, sometimes on a statement-by-statement basis within a converted parallel control structure (this will be seen more explicitly in Chapter 6). Each of the dl_mem_by_dl(...) calls determines the current mapping for its subject array, filling-in values that will be used in the loop. Note that a determination is made in this routine about required inter-processor communication for any array elements, but no actual communication is initiated yet. Then dd_preloop_xchng(...) compares the current node "locality" recorded earlier in the "control distribution" with values saved within the runtime system by each dl_mem_by_dl call (a "work list"), and actually performs any needed inter-processor communication. Finally, dl_modify(...) updates the scalar variables that might be affected by such a communication.

For this trivial code with the directives as given, of course there is no required communication.

After the loop code there is:

      CALL dd_postloop_xchng(18, 25, 'three_assignments_2.f.F77')

This anticipates that some later array-using code might require a different mapping of the data, and performs any anticipatory inter-processor communication. Again, the saved "work list" is used to make the determination, and again, for this code, no communication would be needed except for the interface to the use_them invocation: if xHPF had been provided with the source for use_them and its mappings of the three arrays corresponded with the caller's, no actual communication would be executed.

Our experience with xHPF has shown that its overhead, even in these "null communication" cases is fairly small. One can get an actual measurement of that from the instrumented library that is part of the xHPF product.

4.2 PGI pghpf

With "O3" optimization, the three assignments are translated into the following:

!     forall (i$i=i$$l1:i$$u1:1, i$i1=i$$l:i$$u:1) a2((u$$b2-l$$b2+1)*(
!    +i$i-l$$b3)+i$i1-l$$b2+a2$p) = a1((u$$b-l$$b+1)*(i$i-l$$b1)+i$i1-
!    +l$$b+a1$p)
      do i$i = i$$l1, i$$u1
         do i$i1 = i$$l, i$$u
            a2((u$$b2-l$$b2+1)*(i$i-l$$b3)+i$i1-l$$b2+a2$p) = a1((u$$b-
     +l$$b+1)*(i$i-l$$b1)+i$i1-l$$b+a1$p)
            b2((u$$b2-l$$b2+1)*(i$i-l$$b3)+i$i1-l$$b2+b2$p) = b1((u$$b4-
     +l$$b4+1)*(i$i-l$$b5)+i$i1-l$$b4+b1$p)
            c2((u$$b2-l$$b2+1)*(i$i-l$$b3)+i$i1-l$$b2+c2$p) = c1((u$$b4-
     +l$$b4+1)*(i$i-l$$b5)+i$i1-l$$b4+c1$p)
         enddo
      enddo

which is seen in context in the Fortran file that can be saved during a pghpf translation. Loop jamming has been implemented in this case. (With no optimization in the compiler invocation, each assignment would be translated into its own loop.)

The commented-"forall" line is an abbreviated indication of the pghpf compiler's parallelization accomplishments. (Only the first assignment of the three is indicated in the comment -- this is not intended to be a correct, HPF FORALL statement.) Here both dimensions of distribution are implemented. Also, since this compiler determines that there can be no inter-processor communication, it generates no extra calls. PGI's pghpf has no facility corresponding to xHPF's "FULL" allocations, so all distributed arrays are locally allocated in per-node-portion form in the generated code.

4.3 IBM XL HPF

XL HPF translates the three assignment statements as:

C 1585-501  Original Source Line 6
       do i_5=iown_l_15,MIN0(iown_u_16,40),1
C 1585-501  Original Source Line 6
         do i_6=iown_l_17,MIN0(iown_u_18,30),1
           a2_28(i_6,i_5) = a1_27(i_6,i_5)
           b2_30(i_6,i_5) = b1_29(i_6,i_5)
           c2_32(i_6,i_5) = c1_31(i_6,i_5)
         end do
       end do

as extracted from the pseudo-Fortran listing. Clearly there is good loop-jamming here.

This is the form of the code with the compiler's "-qhot" (High Order loop Transformations) optimization, and this is the portion of the listing elicited by the "-qreport=hotlist" command-line option. (Examination of earlier portions of that listing show all the invocation options and the more line-by-line "hpflist"-elicited HPF Parallelization Report.)

This compiler generated a two-dimensional parallel spreading of the assignments. The arrays are allocated locally with each processor holding only its portion. No communication is needed and so none is generated.