NHSE Review 1996 Volume Second Issue: Comparison of 3 Compilers -- Chapter 6

NHSE Review^TM 1996 Volume Second Issue

Comparison of 3 HPF Compilers

| <- HREF="chap5.html" Prev | Index | Next -> |
NHSE Review^TM: Comments · Archive · Search

Chapter 6 -- Communication

Most HPF applications, even with maximal data parallel operations, require coordinated inter-processor transmission of array values that are not perfectly aligned. The following small example forces each compiler to communicate array elements, using a regular pattern, to complete each of the two array assignment statements:

      PROGRAM two_triples

      REAL, DIMENSION(30,40) :: a1, a2, b1, b2
!HPF$ DISTRIBUTE a1(BLOCK, BLOCK)
!HPF$ ALIGN (:,:) WITH a1(:,:) :: a2, b1, b2

      CALL init( a1, b1)

      a2(1:29, 1:39) = a1(2:30, 2:40)
      b2(2:30, 2:40) = b1(1:29, 1:39)

      CALL use( a2, b2)

      END

For any processor arrangement (with greater than one processor) elements of the a1 and b1 arrays at the "edges" of the local blocks will need to be communicated to an "adjacent" processor. The two subroutine calls (to init and use) have been inserted so that an extremely clever optimizer in any compiler could not immediately deduce that a1 or b1 were not initialized or that a2 or b2 were never used after the assignments.

6.1 APR xHPF

Before examining the parallel-code translation produced by xHPF, it is useful to compare what was generated within the translator in analyzing the Fortran 90 source. xHPF makes an intermediate translation from Fortran 90 to semantically-equivalent (serial) FORTRAN 77 source code prior to doing its final parallelism analysis and translation for parallel execution. (This partially stems from the product's orientation as a translator of FORTRAN 77-based HPF codes and its automatic parallelization focus on that language.) How that translation is performed may have an ultimate effect on the manner in which the communication code is organized. In the following command-line invocation to xHPF:

  xhpf -keep -verbose=20 -ompf=two_triples_f90_mpf.f \
               -BenignUnknown two_triples.f

the -keep causes the intermediate FORTRAN 77 file to be preserved:

      PROGRAM two_triples

      REAL a1(30, 40), a2(30, 40), b1(30, 40), b2(30, 40)
CHPF$ DISTRIBUTE a1(BLOCK, BLOCK)
CHPF$ ALIGN (:, :) WITH a1(:, :) :: a2, b1, b2
      INTEGER apr_opt_int_arg
      CHARACTER apr_opt_char_arg
      COMMON /apr_res1/apr_opt_int_arg/apr_res2/apr_opt_char_arg
      INTEGER a, a0, a3, a4, a5

      CALL init(a1, b1)

      DO a3 = 1, 39
        DO a0 = 1, 29
          a2(a0, a3) = a1(1 + a0, 1 + a3)
          b2(1 + a0, 1 + a3) = b1(a0, a3)
        ENDDO
      ENDDO

      CALL use(a2, b2)

      END

Notice that, apart from the extra declarations, anticipating aspects of the xHPF runtime system, the translation from Fortran 90 to FORTRAN 77 is optimal in that there has been recognition that the subscript triples in both statements are mutually conformal, and that there is no "overlap" in elements between right-hand-sides and left-hand-sides, or between any fetches or stores between the two assignments (there are no true-, loop-carried-, or anti-dependences).

The communication code generated for these two assignments is:

      CALL dd_dstloop(10, 1, 39, 1, dtx, dtx0, dtx1, a2, 162, -11, 1, 1
     .    , 29, 3, 1, 1, 10)
      CALL dd_use_by_dl(a1, 126, -11, 2, 1, 29, 3, 2, 1, 10)
      CALL dd_set_by_dl(b2, 144, -11, 2, 1, 29, 3, 2, 1, 10)
      CALL dd_preloop_xchng(13, 17, 'two_triples.f.F77', dtx, dtx0, dtx1
     .    )
      DO a3 = dtx, dtx0, dtx1
        DO a0 = 1, 29
          a2(a0, a3) = a1(1 + a0, 1 + a3)
          b2(1 + a0, 1 + a3) = b1(a0, a3)
        ENDDO
      ENDDO
      CALL dd_postloop_xchng(19, 17, 'two_triples.f.F77')

seen in full context in the entire generated code. The parallelism has defaulted to a single dimension, the rightmost of each array, as in our other examples using xHPF.

Examination of the communication pattern of the above (e.g., with a software communications-trace-analyzer on the SP) shows that the initial and final communications patterns, the ...preloop... and ...postloop..., are strongly influenced by the adjacent calls to the init and use subroutines. xHPF has no interface information for those two subroutines so it:

assumes they could be either HPF or non-HPF: it does not require an INTERFACE block for the latter -- this is an extension along the lines of EXTRINSIC(...),
satisfies the latter in the converted data declarations for the four arrays: each array is allocated its full Fortran 90-declared extent, and
satisfies the former by arranging to replicate all the arrays if needed by each subroutine.

Despite any excessive communication before or after the two assignments, one can see from the generated code that there will be no extra communication within either the "do a3" or inner "do a0" loops. This exhibits xHPF's extended implementation of "owner computes" as not necessarily being tied to assignment left-hand-side array references. Also, the APR runtime system is prepared to move only those values of a1 (before the loop nest) or b2 (after the loop nest) that are not local to each processor's elements of a2 (equivalently, b1) in the subscript range (1:29, 1:39).

6.2 PGI pghpf

The pghpf translation does not take advantage of the conformability of the four array sections and the lack of dependences when rendered to per-processor code, even when the translator is invoked with -O3 optimization level. It generates communication and loop control for each assignment statement:

      cp$$com = pghpf_olap_shift(a1(a1$p),a1$d2(a1$dp2),1,1,1,1)
      xfer$$com = pghpf_comm_start(cp$$com,a1(a1$p),a1$d2(a1$dp2),a1(
     +a1$p),a1$d2(a1$dp2))
      call pghpf_comm_finish(xfer$$com)
      call pghpf_localize_bounds(b2$d1(b2$dp1),1,1,29,1,i$$l,i$$u)
      call pghpf_localize_bounds(b2$d1(b2$dp1),2,1,39,1,i$$l1,i$$u1)
!     forall (i$i=i$$l1:i$$u1:1, i$i1=i$$l:i$$u:1) a2((u$$b2-l$$b2+1)*(
!    +i$i-l$$b3)+i$i1-l$$b2+a2$p) = a1(i$i1+1+(u$$b-l$$b+1)*(i$i+1-l$$b1
!    +)-l$$b+a1$p)
      do i$i = i$$l1, i$$u1
         do i$i1 = i$$l, i$$u
            a2((u$$b2-l$$b2+1)*(i$i-l$$b3)+i$i1-l$$b2+a2$p) = a1(i$i1+1+
     +(u$$b-l$$b+1)*(i$i+1-l$$b1)-l$$b+a1$p)
         enddo
      enddo
      xfer$$com1 = pghpf_comm_start(cp$$com,b1(b1$p),b2$d1(b2$dp1),b1(
     +b1$p),b2$d1(b2$dp1))
      call pghpf_comm_finish(xfer$$com1)
      call pghpf_localize_bounds(b2$d1(b2$dp1),1,2,30,1,i$$l2,i$$u2)
      call pghpf_localize_bounds(b2$d1(b2$dp1),2,2,40,1,i$$l3,i$$u3)
!     forall (i$i=i$$l3:i$$u3:1, i$i1=i$$l2:i$$u2:1) b2((u$$b2-l$$b2+1)*
!    +(i$i-l$$b3)+i$i1-l$$b2+b2$p) = b1(i$i1-1+(u$$b2-l$$b2+1)*(i$i-1-
!    +l$$b3)-l$$b2+b1$p)
      do i$i = i$$l3, i$$u3
         do i$i1 = i$$l2, i$$u2
            b2((u$$b2-l$$b2+1)*(i$i-l$$b3)+i$i1-l$$b2+b2$p) = b1(i$i1-1+
     +(u$$b2-l$$b2+1)*(i$i-1-l$$b3)-l$$b2+b1$p)
         enddo
      enddo
      call pghpf_comm_free(1,cp$$com)

Apart from some initial allocation of the per-processor dynamically-determined portion of each array, there isn't much more executable code in the complete generated source.

Parallel execution is indicated by comments containing FORALL, and is along both array dimensions. For each assignment statement, all communication is accomplished prior to the per-processor double-loop, with the invocations of pghpf_olap_shift(...), pghpf_comm_start(...), and pghpf_comm_finish(...), using memory space managed with bookkeeping in PGI's runtime system via the INTEGER variable cp$$com. Note the freeing-up of that space after both assignments at the call to subroutine pghpf_comm_free(...).

Note also that pghpf assumes that any user routine for which it has not been supplied an INTERFACE block is an HPF routine. In the generated source, the interface to each of init and use has been expanded from two to four arguments: the extra arguments give the PGI procedure-invocation runtime system access to details of the mapping of each original array argument. Any required remappings would be performed in each subroutine. This code allocates only that part of the mapped array that is local to each processor.

6.3 IBM XL HPF

IBM XL HPF, for all that it has a well-integrated optimizer and has elsewhere been seen by CTC to generate impressively clean code, doesn't handle the communication for these two assignments any better than pghpf: each assignment is translated into its own, separate, communication actions followed by its own nested do-loop pair. Since the generated-code pseudo-Fortran listing is at such a low semantic level (although the MPI calls cannot be directly seen), a full extract of that listing for those two statements spans 128 lines. The full listing (approximately 20K bytes) can be viewed with that region hi-lighted. That listing has both the earlier, pre-"High Order Transform", annotated-pseudo-source "HPF Parallelization Report", and also the later "Loop Transformation Report". In fact, except for one extra declaration in the later portion, those listing portions are identical.

Each statement's essential computation is preceded by two sequences with general structure:

C 1585-501  Original Source Line ...
       if ( ...<processor to receive values>... ) then
         ...<initialize communication-control data structures>...
         call _xlhpf_nbreceive_section(<right-hand-side-array>, ... )
       end   if
C 1585-501  Original Source Line ...
       if ( ...<processor to send values>... ) then
         ...<initialize communication-control data structures>...
         call _xlhpf_send_section(<right-hand-side-array>, ... )
       end if
       call _xlhpf_waitforall(1)

Each one of the two receive/send/wait sequences deals with communication along one of the two mapped dimensions of the array.

Finally, the on-processor essential computations (assignments, here) are performed by do-loop nests. The assignment a2(1:29, 1:39) = a1(2:30, 2:40) generates:

C 1585-501  Original Source Line 9
       do i_5=iown_l_15,MIN0(iown_u_16,39),1
C 1585-501  Original Source Line 9
         do i_6=iown_l_17,MIN0(iown_u_18,29),1
           a2_46(i_6,i_5) = a1_45(i_6 + 1,i_5 + 1)
         end do
       end do

and the assignment b2(2:30, 2:40) = b1(1:29, 1:39) generates:

C 1585-501  Original Source Line 10
       do i_5=MAX0(iown_l_15 - 1,1),MIN0(iown_u_16 - 1,39),1
C 1585-501  Original Source Line 10
         do i_6=MAX0(iown_l_17 - 1,1),MIN0(iown_u_18 - 1,29),1
           b2_48(i_6 + 1,i_5 + 1) = b1_47(i_6,i_5)
         end do
       end do

Loop bounds, such as iown_l_15 are initialized in code early in the pseudo-Fortran, such as:

       iown_l_15 = 1 + ((40 + PGB_10(2)) - 1) / PGB_10(2) * PID_11(2)

proximate to the allocation of the mapped array portions for each processor, and prior to the communication code shown above.

Given the conditional subroutine calls for communication, and the different expressions used for each loop-nest's bounds, it is highly unlikely that any later optimization phase of the XL HPF compiler could combine the two array assignments in the way that the xHPF compiler has.

The init and use subroutine calls generate no extra code in the calling program and appear to take arguments that are pointers to run time descriptor structures for each of their array arguments (see the POINTER declaration and commentary in the pseudo-Fortran for each of a1, a2, b1, and b2).