| <- HREF="chap4.html" Prev | Index | Next -> |
NHSE ReviewTM: Comments · Archive · Search
The following trivial example is examined:
PROGRAM sum_prog REAL, DIMENSION(30,40) :: a1, a2 !HPF$ DISTRIBUTE a1(BLOCK, BLOCK) !HPF$ ALIGN (:,:) WITH a1(:,:) :: a2 s = sum(a1+a2) call use_it(s) ENDThere is a two-dimensional elemental summation inside the invocation of the array reduction. Again, an extra subroutine invocation prevents the final scalar value from being assumed to be "unused" at high levels of compiler optimization.
The assignment statement generates the following:
s = 0.0 CALL dd_dstloop(10, 1, 40, 1, dtx, dtx0, dtx1, a1, 108, -11, 1, 1 . , 30, 3, 1, 1, 10) dtx3=-1 CALL dd_def_red(s, dd_tdesc(s, 162, 12), 2, 1) CALL dl_mem_by_dl(a1, 108, 5, hi, -11, 1, 1, 30, hi0, 3, 1, 1, 10 . , hi1) CALL dl_mem_by_dl(a2, 126, 5, hi2, -11, 1, 1, 30, hi3, 3, 1, 1, 10 . , hi4) CALL dd_preloop_xchng(11, 16, 'sum_prog_2.f.F77', dtx, dtx0, dtx1 . , s) CALL dl_modify(hi4, hi3, hi2, hi1, hi0, hi) DO a3 = dtx, dtx0, dtx1 dtx3=dtx3+1 dtx2=-1 DO a0 = 1, 30 dtx2=dtx2+1 s=s+a1(hi+dtx2*hi0+dtx3*hi1)+a2(hi2+dtx2*hi3+dtx3*hi4) ENDDO ENDDO CALL dd_postloop_xchng(16, 16, 'sum_prog_2.f.F77', s)as seen in the generated _mpf.f file.
Those lines that are significant to the elemental summation and to the
reduction are hilighted. The "work list" for communication is
indicated by the dd_def_red(...)
call. Each node
generates its local elemental summation of its portion of a1 and a2
and its local partial scalar summation to s in the inner "DO a0"
loop. Finally the inter-processor communication needed to perform the
array reduction (and to convey the scalar result to all the
participating processors) is performed by the
dd_postloop_xchng(...)
call.
At CTC we have examined the communication pattern involved and it is a Log-P-level tree over the P participating processors. The MPI communications are accomplished with blocking sends and receives.
PGI pghpf instantiates a temporary array, a1$a, to hold the elemental summation, and then invokes its own run time system summation-reduction routine:
call pghpf_localize_bounds(a1$a$d1(a1$a$dp1),1,1,30,1,i$$l,i$$u) call pghpf_localize_bounds(a1$a$d1(a1$a$dp1),2,1,40,1,i$$l1,i$$u1) ! forall (i$i=i$$l1:i$$u1:1, i$i1=i$$l:i$$u:1) a1$a((u$$b-l$$b+1)*( ! +i$i-l$$b1)+i$i1-l$$b+a1$a$p) = a1((u$$b2-l$$b2+1)*(i$i-l$$b3)+i$i1 ! +-l$$b2+a1$p) + a2((u$$b4-l$$b4+1)*(i$i-l$$b5)+i$i1-l$$b4+a2$p) do i$i = i$$l1, i$$u1 do i$i1 = i$$l, i$$u a1$a((u$$b-l$$b+1)*(i$i-l$$b1)+i$i1-l$$b+a1$a$p) = a1((u$$b2 +-l$$b2+1)*(i$i-l$$b3)+i$i1-l$$b2+a1$p) + a2((u$$b4-l$$b4+1)*(i$i- +l$$b5)+i$i1-l$$b4+a2$p) enddo enddo call pghpf_sums(a1$a$r,a1$a(a1$a$p),.true.,27,a1$a$d1(a1$a$dp1),19 +) s = a1$a$rThis can seen in the context of the full saved Fortran file.
The IBM XL HPF strategy for this case is similar to that of xHPF in
that it uses local scalar variables (here the compiler-generated
ScRed_40
, ..._41
, ..._42
,
..._43
, and SCALAR_29
) for per-node partial
sums. Note how the elemental-summation loop has been "unrolled" four
times at this high level of compiler optimization. XL HPF also resembles
xHPF in that it accomplishes the elemental array summation in the same
do i_10
/do i_10
loop-pair that computes
the local sum. Then it invokes its own run time system routine,
_xlhpf_reduce_sum(...)
to complete the reduction from each
node's local scalar sum:
s = 0. SCALAR_29 = 0. ScRed_40 = dble(SCALAR_29) ScRed_41 = dble(0) ScRed_42 = dble(0) ScRed_43 = dble(0) C 1585-501 Original Source Line 6 do i_9=iown_l_19,MIN0(iown_u_20,40),1 C 1585-501 Original Source Line 6 do i_10=iown_l_21,MIN0(iown_u_22,30) - 3,4 ScRed_40 = ScRed_40 + dble(a1_35(i_10,i_9) + a2_36(i_10,i_9)) & ScRed_41 = ScRed_41 + dble(a1_35(i_10 + 1,i_9) + a2_36(i_10 + & 1,i_9)) ScRed_42 = ScRed_42 + dble(a1_35(i_10 + 2,i_9) + a2_36(i_10 + & 2,i_9)) ScRed_43 = ScRed_43 + dble(a1_35(i_10 + 3,i_9) + a2_36(i_10 + & 3,i_9)) end do C 1585-501 Original Source Line 6 do i_10=i_10,MIN0(iown_u_22,30),1 ScRed_40 = ScRed_40 + dble(a1_35(i_10,i_9) + a2_36(i_10,i_9)) & end do end do SCALAR_29 = real(ScRed_40 + ScRed_41 + ScRed_42 + ScRed_43) Recv_index_32(1) = (-2) Recv_index_32(2) = (-2) Send_index_33(1) = 0 Send_index_33(2) = 1 DS_SAS_34(1) = 0 DS_SAS_34(2) = MIN0(29 / D_18(1),PGB_14(1) - 1) DS_SAS_34(3) = 1 DS_SAS_34(4) = 0 DS_SAS_34(5) = MIN0(39 / D_18(3),PGB_14(2) - 1) DS_SAS_34(6) = 1 call _xlhpf_reduce_sum(9,SCALAR_29,s,PG_16,2,Send_index_33,DS_SAS &_34,Recv_index_32)Complete details can be seen in the pseudo-Fortran listing.
At CTC we have observed that the reduction communication pattern is accomplished with an MPI collective communication REDUCE_ALL followed by an MPI broadcast of the scalar value.
| <- HREF="chap4.html" Prev | Index | Next -> |
NHSE ReviewTM: Comments · Archive · Search