For a 256 trace, 1024 time point section, the program in this directory
ran at only 1/15'th of the rated peak speed of the hypercube.  These exercises
explore ways to increase performance.

1)  Static Load Balancing:

    As written every processor does the same amount of computing and I/O
    in the migration program.  However much of this is done on dummy zero
    vectors.  In reality processor 0 does more useful work than processor 1
    which in turn does more work than processor 2 and so on.  By tuning
    the number of tau steps in each processor we can more evenly distribute
    the work.  a) Give a formula that each node can use to determine its
    own value of "mtau" based on its node number.  b) Will this redistribution
    decrease the time needed to migrate a section on the cube?  By how much?  

2)  Dynamic Load Balancing A:

    An alternative to fixing the number of tau steps per node at initialization
    is to let this number start at 1 and once every node has input increase
    it to 2, then 3, etc.  This is a dynamic load balancing and requires
    shifting the grid sideways across the nodes every so often.  a)  How
    much of the grid shifting can be "hidden" in the data shifting that
    is already done?  b) If we treat the cost of a data shift to be
    about the same as one extrapolation,  does the dynamic load balancing
    decrease the time needed to migrate a section on the cube?  Consider
    the two special cases  nproc << nt  and  nproc = nt.

3)  Dynamic Load Balancing B:

    The load balancing scheme suggested above required shifting the differencing
    grid between processors in addition to shifting the data.  An alternative
    is to modify our linear configuration of the hypercube so that it forms
    as ring with the last processor adjacent to the first.  This lets us spiral
    the computational grid around and around the hypercube.  A clear drawback
    of this rather elegant mapping is that each processor's communication
    grows in direct proportion to the amount of computation per node.  If
    we were able to overlap computation and communication this would not be
    so much of a problem.  The NCUBE hypercube does permit asynchronous
    I/O simultaneously to each of a processor's neighbors.  This suggests we
    spiral our grid in a different direction each time we pass through a node
    in order to take advantage of the parallel I/O.  Design a suitable mapping
    for this.  (Hint: circular bit shifts.)

4)  Increased Tau Steps:

    In industry practice it is rare to migrate a section with a tau step
    equal to the input sampling interval.   Usually a coarser step is used,
    say 10 times the input sampling interval.  This reduces the number of
    extrapolation steps by approximately the same factor.  Can this be
    used to reduce the total time needed to migrate a section on the hypercube?
    How?

5)  Inner Loop Optimization:

    The parallelism we've exploited is in the outer t-tau loops.  The inner
    loop over x, the spatial dimension, is still recursive as it solves
    a tridiagonal system using the standard L-D-U method.  One alternative
    is due to Wang which allows multiple nodes to work on solving a tridiagonal
    system.  For two nodes, pictorially it transforms:


    x x                            x       x
    x x x                            x     x
      x x x                            x   x
        x x x          TO                x x     
          x x x                            y     y
            x x x                          x x   x
              x x x                        x   x x
                x x                        y     y

        
   with the first processor handling the top half and the second the bottom
   half.  This results in a reduced banded system, indicated by the y's, with
   one row per processor.  This small system is then solved by L-D-U and
   back substituted, again with each node back sustituting in its half of the
   matrix.   a) Is this more computation than the L-D-U method?  Does it
   accomplish solution of the tridiagonal system faster?


   An alternative to parallelizing the inner loop is to interchange loops
   to make the x-loop an outer loop, that is solve several tridiagonal
   systems simultaneously.  Will this allow us to bring more processors
   to bear on performing a migration?   Will it get the migration done
   any faster?