NHSE Review 1996 Volume Second Issue: Comparison of 3 Compilers -- Chapter 5

NHSE Review^TM 1996 Volume Second Issue

Comparison of 3 HPF Compilers

| <- HREF="chap4.html" Prev | Index | Next -> |
NHSE Review^TM: Comments · Archive · Search

Chapter 5 -- Array Reduction

Fortran 90 reduction intrinsic functions, such SUM, PRODUCT, MAXVAL, etc., are inevitable non-embarrassingly-parallel aspects of any serious application written in HPF. How the compiler handles the necessary single-node code generation and communication can have a large bearing on the performance of any application.

The following trivial example is examined:

      PROGRAM sum_prog
      REAL, DIMENSION(30,40) :: a1, a2
!HPF$ DISTRIBUTE a1(BLOCK, BLOCK)
!HPF$ ALIGN (:,:) WITH a1(:,:) :: a2

      s = sum(a1+a2)

      call use_it(s)

      END

There is a two-dimensional elemental summation inside the invocation of the array reduction. Again, an extra subroutine invocation prevents the final scalar value from being assumed to be "unused" at high levels of compiler optimization.

5.1 APR xHPF

The assignment statement generates the following:

      s = 0.0
      CALL dd_dstloop(10, 1, 40, 1, dtx, dtx0, dtx1, a1, 108, -11, 1, 1
     .    , 30, 3, 1, 1, 10)
      dtx3=-1
      CALL dd_def_red(s, dd_tdesc(s, 162, 12), 2, 1)
      CALL dl_mem_by_dl(a1, 108, 5, hi, -11, 1, 1, 30, hi0, 3, 1, 1, 10
     .    , hi1)
      CALL dl_mem_by_dl(a2, 126, 5, hi2, -11, 1, 1, 30, hi3, 3, 1, 1, 10
     .    , hi4)
      CALL dd_preloop_xchng(11, 16, 'sum_prog_2.f.F77', dtx, dtx0, dtx1
     .    , s)
      CALL dl_modify(hi4, hi3, hi2, hi1, hi0, hi)
      DO a3 = dtx, dtx0, dtx1
      dtx3=dtx3+1
      dtx2=-1
        DO a0 = 1, 30
      dtx2=dtx2+1
      s=s+a1(hi+dtx2*hi0+dtx3*hi1)+a2(hi2+dtx2*hi3+dtx3*hi4)
        ENDDO
      ENDDO
      CALL dd_postloop_xchng(16, 16, 'sum_prog_2.f.F77', s)

as seen in the generated _mpf.f file.

Those lines that are significant to the elemental summation and to the reduction are hilighted. The "work list" for communication is indicated by the dd_def_red(...) call. Each node generates its local elemental summation of its portion of a1 and a2 and its local partial scalar summation to s in the inner "DO a0" loop. Finally the inter-processor communication needed to perform the array reduction (and to convey the scalar result to all the participating processors) is performed by the dd_postloop_xchng(...) call.

At CTC we have examined the communication pattern involved and it is a Log-P-level tree over the P participating processors. The MPI communications are accomplished with blocking sends and receives.

5.2 PGI pghpf

PGI pghpf instantiates a temporary array, a1$a, to hold the elemental summation, and then invokes its own run time system summation-reduction routine:

      call pghpf_localize_bounds(a1$a$d1(a1$a$dp1),1,1,30,1,i$$l,i$$u)
      call pghpf_localize_bounds(a1$a$d1(a1$a$dp1),2,1,40,1,i$$l1,i$$u1)
!     forall (i$i=i$$l1:i$$u1:1, i$i1=i$$l:i$$u:1) a1$a((u$$b-l$$b+1)*(
!    +i$i-l$$b1)+i$i1-l$$b+a1$a$p) = a1((u$$b2-l$$b2+1)*(i$i-l$$b3)+i$i1
!    +-l$$b2+a1$p) + a2((u$$b4-l$$b4+1)*(i$i-l$$b5)+i$i1-l$$b4+a2$p)
      do i$i = i$$l1, i$$u1
         do i$i1 = i$$l, i$$u
            a1$a((u$$b-l$$b+1)*(i$i-l$$b1)+i$i1-l$$b+a1$a$p) = a1((u$$b2
     +-l$$b2+1)*(i$i-l$$b3)+i$i1-l$$b2+a1$p) + a2((u$$b4-l$$b4+1)*(i$i-
     +l$$b5)+i$i1-l$$b4+a2$p)
         enddo
      enddo
      call pghpf_sums(a1$a$r,a1$a(a1$a$p),.true.,27,a1$a$d1(a1$a$dp1),19
     +)
      s = a1$a$r

This can seen in the context of the full saved Fortran file.

5.3 IBM XL HPF

The IBM XL HPF strategy for this case is similar to that of xHPF in that it uses local scalar variables (here the compiler-generated ScRed_40, ..._41, ..._42, ..._43, and SCALAR_29) for per-node partial sums. Note how the elemental-summation loop has been "unrolled" four times at this high level of compiler optimization. XL HPF also resembles xHPF in that it accomplishes the elemental array summation in the same do i_10/do i_10 loop-pair that computes the local sum. Then it invokes its own run time system routine, _xlhpf_reduce_sum(...) to complete the reduction from each node's local scalar sum:

       s = 0.
       SCALAR_29 = 0.
       ScRed_40 = dble(SCALAR_29)
       ScRed_41 = dble(0)
       ScRed_42 = dble(0)
       ScRed_43 = dble(0)
C 1585-501  Original Source Line 6
       do i_9=iown_l_19,MIN0(iown_u_20,40),1
C 1585-501  Original Source Line 6
         do i_10=iown_l_21,MIN0(iown_u_22,30) - 3,4
           ScRed_40 = ScRed_40 + dble(a1_35(i_10,i_9) + a2_36(i_10,i_9))
     &
           ScRed_41 = ScRed_41 + dble(a1_35(i_10 + 1,i_9) + a2_36(i_10 +
     & 1,i_9))
           ScRed_42 = ScRed_42 + dble(a1_35(i_10 + 2,i_9) + a2_36(i_10 +
     & 2,i_9))
           ScRed_43 = ScRed_43 + dble(a1_35(i_10 + 3,i_9) + a2_36(i_10 +
     & 3,i_9))
         end do
C 1585-501  Original Source Line 6
         do i_10=i_10,MIN0(iown_u_22,30),1
           ScRed_40 = ScRed_40 + dble(a1_35(i_10,i_9) + a2_36(i_10,i_9))
     &
         end do
       end do
       SCALAR_29 = real(ScRed_40 + ScRed_41 + ScRed_42 + ScRed_43)
       Recv_index_32(1) = (-2)
       Recv_index_32(2) = (-2)
       Send_index_33(1) = 0
       Send_index_33(2) = 1
       DS_SAS_34(1) = 0
       DS_SAS_34(2) = MIN0(29 / D_18(1),PGB_14(1) - 1)
       DS_SAS_34(3) = 1
       DS_SAS_34(4) = 0
       DS_SAS_34(5) = MIN0(39 / D_18(3),PGB_14(2) - 1)
       DS_SAS_34(6) = 1
       call _xlhpf_reduce_sum(9,SCALAR_29,s,PG_16,2,Send_index_33,DS_SAS
     &_34,Recv_index_32)

Complete details can be seen in the pseudo-Fortran listing.

At CTC we have observed that the reduction communication pattern is accomplished with an MPI collective communication REDUCE_ALL followed by an MPI broadcast of the scalar value.