For a 256 trace, 1024 time point section, the program in this directory ran at only 1/15'th of the rated peak speed of the hypercube. These exercises explore ways to increase performance. 1) Static Load Balancing: As written every processor does the same amount of computing and I/O in the migration program. However much of this is done on dummy zero vectors. In reality processor 0 does more useful work than processor 1 which in turn does more work than processor 2 and so on. By tuning the number of tau steps in each processor we can more evenly distribute the work. a) Give a formula that each node can use to determine its own value of "mtau" based on its node number. b) Will this redistribution decrease the time needed to migrate a section on the cube? By how much? 2) Dynamic Load Balancing A: An alternative to fixing the number of tau steps per node at initialization is to let this number start at 1 and once every node has input increase it to 2, then 3, etc. This is a dynamic load balancing and requires shifting the grid sideways across the nodes every so often. a) How much of the grid shifting can be "hidden" in the data shifting that is already done? b) If we treat the cost of a data shift to be about the same as one extrapolation, does the dynamic load balancing decrease the time needed to migrate a section on the cube? Consider the two special cases nproc << nt and nproc = nt. 3) Dynamic Load Balancing B: The load balancing scheme suggested above required shifting the differencing grid between processors in addition to shifting the data. An alternative is to modify our linear configuration of the hypercube so that it forms as ring with the last processor adjacent to the first. This lets us spiral the computational grid around and around the hypercube. A clear drawback of this rather elegant mapping is that each processor's communication grows in direct proportion to the amount of computation per node. If we were able to overlap computation and communication this would not be so much of a problem. The NCUBE hypercube does permit asynchronous I/O simultaneously to each of a processor's neighbors. This suggests we spiral our grid in a different direction each time we pass through a node in order to take advantage of the parallel I/O. Design a suitable mapping for this. (Hint: circular bit shifts.) 4) Increased Tau Steps: In industry practice it is rare to migrate a section with a tau step equal to the input sampling interval. Usually a coarser step is used, say 10 times the input sampling interval. This reduces the number of extrapolation steps by approximately the same factor. Can this be used to reduce the total time needed to migrate a section on the hypercube? How? 5) Inner Loop Optimization: The parallelism we've exploited is in the outer t-tau loops. The inner loop over x, the spatial dimension, is still recursive as it solves a tridiagonal system using the standard L-D-U method. One alternative is due to Wang which allows multiple nodes to work on solving a tridiagonal system. For two nodes, pictorially it transforms: x x x x x x x x x x x x x x x x x TO x x x x x y y x x x x x x x x x x x x x x y y with the first processor handling the top half and the second the bottom half. This results in a reduced banded system, indicated by the y's, with one row per processor. This small system is then solved by L-D-U and back substituted, again with each node back sustituting in its half of the matrix. a) Is this more computation than the L-D-U method? Does it accomplish solution of the tridiagonal system faster? An alternative to parallelizing the inner loop is to interchange loops to make the x-loop an outer loop, that is solve several tridiagonal systems simultaneously. Will this allow us to bring more processors to bear on performing a migration? Will it get the migration done any faster?