As pointed out by David May, it is not enough to bolt fast communications (i.e. high bandwidth) onto fast individual nodes. Parallel machine designers, and parallel programmers, need constantly to remember that the aim is to keep *ALL* the processors busy doing useful compute (rather than optimising single node performance per se).
Challenges remain at several levels. At the architecture level: What is a "well balanced" (parallel) machine? Can the industry be persuaded to build some? At the programming model level: Is there a single "unifying" model that is good enough? Or do we want a (small) plurality, "horses for courses"? And how best should performance issues be modelled? At the language level: Can the level of expression be raised to match more closely the "natural" parallelism in problems? Should parallelism indeed be expressed, or generated? How, and to what extent, should performance issues show through to the programmer? At the compiler and run-time level: Does the compiler need to extract parallelism? If so, how and when? How do we map effectively, i.e. to utilise the machine resources to maximise the total useful compute? And at the application level: Is a "generative" approach, i.e. application generators for particular domains, the way forward? How do we build these? Will they "guarantee" good performance?
At root, the problem is one of expression and mapping, from applications to (parallel) programs to target architecture. The first (applications) are "given", the second and third can be designed to suit -- and should be, if the compiler and system software is to have a chance!
Finally, and more provocatively, it is no use anymore to think of matching the application (or algorithm) to the architecture. Too much variety is doomed in a rapidly changing industry.
Chris Wadsworth (cpw@inf.rl.ac.uk)