In this chapter, we have focused on the software engineering question of developing a parallel algorithm design from a given problem specification. We have assumed some familiarity with program design---from, for example, previous study in sequential programming. A classic article by Parnas and Clements [222] describes the benefits (and difficulties) of a rational design process. The chapter notes in Chapters 3 and 4 provide additional references to material on the related topics of performance analysis and modularity.
Relatively few authors have addressed the particular problems that arise when designing algorithms for realistic scalable parallel computers. Numerous books discuss parallel algorithm design in the context of the idealized PRAM model; see, for example, Akl [8], Gibbons and Reitter [119], and JáJá [157]. However, these techniques are for the most part not directly applicable to multicomputers. The books by Fox et al. [111,113] and Kumar et al. [181] provide relevant material. Carriero and Gelernter [48] give an introduction to parallel program design in the Linda language. They distinguish between agenda, result, and specialist parallelism, which can be thought of as domain decomposition and two different forms of functional parallelism, respectively. See also the references in Chapter 12.
The mapping problem has received considerable attention in the computer science and scientific computing literature. Bokhari shows that it is NP -complete [39]. The recursive coordinate bisection algorithm is due to Berger and Bokhari [33], and the unbalanced recursive bisection algorithm is due to Jones and Plassmann [161]. A related algorithm is the recursive spectral bisection algorithm of Pothen, Simon, and Liou [230], which uses connectivity information to identify partitions of unstructured meshes with low communication requirements, at the cost of an eigenvalue computation. This algorithm has proven very effective, although more expensive than the other recursive algorithms. Simon [258] and Williams [293] compare different algorithms for partitioning unstructured meshes, including coordinate bisection, graph bisection, and spectral bisection. Barnard and Simon [28] describe a less expensive multilevel version of spectral bisection. The mesh in Figure 2.9 is generated using the finite octree technique of Shephard and Georges [256]. Fox et al. [111] and Nicol and Saltz [213] describe the use of cyclic decompositions. Lin and Keller [190] describe a local algorithm. The local algorithm described in the text is termed a receiver-initiated strategy. In an alternative sender-initiated strategy, workers with excess work send it to other workers. Tantawi and Towsley [279] compare sender-initiated and receiver-initiated strategies and show that the former gives better performance if workers are often idle and that the latter performs better when load is heavy. Other relevant papers include those by Berman and Snyder [34]; Chowdhury [61]; Cybenko [68]; Hac [131]; Heath, Rosenberg, and Smith [141]; Kumar, Grama, and Rao [180]; Lo [191]; and Sadayappan and Ercal [250]. Dijkstra, Feijen, and Gasteren [81], Rokusawa et al. [246], and Kumar et al. [179] describe distributed termination-detection algorithms.
Real atmosphere models are of course more complex than the system considered in Section 2.6. Washington and Parkinson [292] provide a good introduction to the numerical methods and algorithms used in climate modeling on sequential computers. A workshop held at the European Center for Medium-range Weather Forecasting surveyed issues involved in executing weather and climate models on parallel computers [155]. The parallel version of the Community Climate Model is described by Drake et al. [86]. Michalakes [206] analyzes load imbalances in climate models and Foster and Toonen [109] describe load-balancing algorithms.
Banerjee [26] describes parallel algorithms for VLSI design. Wimer et al. [294] and Arvindam et al. [17] describe branch-and-bound search algorithms and domain-specific optimizations that can improve performance on floorplanning problems. Reinefeld and Schnecke [243] describe the algorithm variant in which workers redundantly expand several tree levels before selecting nodes for local expansion. Kumar et al. [179,181,239] provide a wealth of material on the design, implementation, and analysis of parallel search algorithms. Quinn [234] also examines branch-and-bound search and describes and analyzes the performance of four different load balancing strategies. For a general introduction to search algorithms, see Nilsson [214], Pearl [225], and Kumar and Kanal [164].
Hehre et al. [142] provide an introduction to ab initio quantum chemistry. Feyereisen and Kendall [96] describe a replicated data algorithm for the Fock matrix construction problem. Colvin et al. [62] describe an algorithm based on domain decomposition techniques. An algorithm that uses distributed data structures and a centralized task scheduler is described by Harrison et al. [108,135].
Here is a Web Tour providing access to additional information on parallel program design and software engineering.
© Copyright 1995 by Ian Foster