Crisis in HPC Discussion - Peter Welch, UKC

Newsgroups: uk.org.epsrc.hpc.discussion
From: P.H.Welch@ukc.ac.uk (phw)
Subject: Superscalar considered harmful
Organization: University of Kent at Canterbury, UK.
Date: Thu, 05 Oct 95 21:41:27 GMT
Message-ID: <1547@larch.ukc.ac.uk>

In article 36 of uk.org.epsrc.hpc.discussion, Cliff Addison
(cliff@liverpool.ac.uk) writes:
| >                                                   ... Its really got
| >  nothing to do with MPP aspects of T3D, its to do with memory
| >  hierarchies.
| 
| This is a point that needs to be repeated and repeated. Sparse matrix
| problems, for instance solving linear systems via conjugate gradients,
| can completely thrash a lot of superscalar superpipelined processors
| because there is a lot of indirect addressing that can cause pipeline
| stalls and inefficient cache utilisation (1 word out of a cache line
| used etc. etc.).
| 
| A great deal of attention has been paid by developers of numerical
| software to exploit locality of reference and to get reasonable levels
| of performance from a range of processors in an easily configured
| manner. A lot of this work involves fairly significant changes to the
| underlying algorithms and data structures employed. Excellent progress
| has been made in some problem areas ...

It's a real shame we keep having to turn our algorithms inside-out to placate the idiosyncracies of the latest hardware architecture. We had to do this for vector processors. Further back, we had to do just what is being asked above in the days when `main' memory was very scarce and the bulk of an application's data had to be on disk - the same memory hierarchy problem scaled down an order of magnitude or two. We shouldn't have thrown those codes away when DRAM became cheap enough to afford in bulk and we could re-write them, with relief, according to the logic of their application and not according to the needs of efficient (virtual) memory management. Looks like we are having to re-invent them - or something fairly similar - and it doesn't look too pretty. What impact do these "fairly significant changes" have on the clarity (and verifyability) of the resulting systems?

Maybe there is something wrong with the concept of superscalar superpipelined processors? What they do is extract a very limited amount of concurrency at the instruction level (2 to 5 instructions per cycle, if you are lucky with their interdependencies). It's difficult to find sufficient inter-instruction indpendencies to keep the piepline full a decent percentage of the time and as soon as something goes wrong (like a cache miss), the thing stalls and you have to flush the pipeline and start again. On a uni-processor, it seems that this happens far too often - so we are being asked to write code that tries to make this happen *very* much less often. Tricky! On a multi-processor, cache incoherency problems may force many more cache flushes and, hence, many more cache misses - what do we do about that?

Instead of spending the silicon real-estate on a superscalar pipeline (say n deep), how about spending it on n scalar processors side-by-side, all connected on-chip (through a sufficiently low latency and high bandwidth cross-bar) to each other, the on-chip cache, external memory and external communications. All we have to find are lots of independent instruction streams (where "lots" means quite a bit greater than n) to direct at the n on-chip processors. We also have to have a very fast way of (context) switching to a spare instruction stream when one of them stalls, bearing in mind that the cost of stalling on a non-pipelined scalar processor is not heavy.

All this seems ever so do-able! One pay-off is that we won't have to distort our codes to maximise cache hits, because the cost of a cache miss on just *one* of the *scalar* processors will not be as severe as a cache miss on the single superscalar pipeline. Secondly, not only will fine-grained parallelism not be detrimental to the performance of the whole parallel machine, it will be essential to the performance of each (multi-processor) node. Joy unbounded! Unless you are stuck with lots of chunky serial Fortran or C code, in which case you are in trouble!! But you can't keep on running those codes for ever ...

In article 37, Lyndon comes back with:

|                          ...  Can we disentangle the performance and
| software issues in using the parallel machines from the performance
| and software issues in using the newer faster commodity microprocessors
| (superscalar, superpipeline, cache, etc)? I do hope so, in fact I believe
| we must, ...

But I don't think we will be able to. Single chip processors are nowadays fairly multi-processor internally and I'm sure that trend will get stronger. I don't think there will be such a thing as a "serial processor", whose performance needs to be optimised independent from parallel considerations. Parallelism will be everywhere, integrated at all levels of silicon. We will have to integrate it at all levels of software to use that silicon efficiently ... but that strikes me as far easier than trying to devise serial algorithms that make good use of cache hierarchies!

As Nick Maclaren says in article 38:

| As far as the caching problems go, these are EXACTLY the same for a single
| microprocessor and a parallel system, with different scaling parameters.
| The algorithmic constraints are almost identical.

Summarising:

Well, it's something to think about anyway.

Peter Welch.


[Prev] [Next]