Crisis in HPC Discussion - Peter Welch, UKC

Newsgroups: uk.org.epsrc.hpc.discussion
From: P.H.Welch@ukc.ac.uk (phw)
Subject: Re. Crisis in HPC - Conclusions
Organization: University of Kent at Canterbury, UK.
Date: Thu, 05 Oct 95 11:41:20 GMT
Message-ID: <171@cypress.ukc.ac.uk>

In article 35 of uk.org.epsrc.hpc.discussion, Lyndon Clarke
(lyndon@epcc.ed.ac.uk) writes:

> * Efficiency of T3D - it's my experience that the performance of T3D programs
>   is determined by the efficiency of the single processor performance. People
>   are reporting efficiencies in the region of 17%, and this is more or less
>   just the single node efficiencies that codes are seeing. Its really got
>   nothing to do with MPP aspects of T3D, its to do with memory hierarchies.
The NAS Parallel benchmarks report T3D nodes (up to 512 of them) getting around 41% peak efficiency on its "Embarassingly Parallel" benchmark, so we know that better single node efficiencies are possible.

Lyndon, and others later, stress that low efficiency problems are most likely caused by inefficient cache utilisation on single nodes and that we really need to separate this problem from problems caused by MPP.

This is a very good idea. The NAS "Embarassingly Parallel" benchmark needs renaming to the "Embarassingly Parallel and Embarassingly Cache-Hit" benchmark and they need to come up with an "Embarassingly Parallel but So-So Cache-Hit" benchmark. If the latter comes up with a 17% efficiency figure, we will know where the current bottleneck is!

[Aside: the embarassingly low 41% efficiency from the "Embarassingly Parallel and Embarassingly Cache-Hit" benchmark is presumably caused by a not very good memory hierarchy in the T3D nodes - can anyone confirm this?]

However, we still need to discover whether the cache-miss single-node problem may be masking a real MPP bottleneck - say at 17.1% ? To check this out, we need a "So-So Parallel but Embarassingly Cache-Hit" benchmark (to reflect clever re-codings of memory-hierarchy-aware algorithms). Will that then recover 41% from 512 nodes?

The "So-So Parallelism" in the above needs to reflect algorithms that are not constrained to coarse-grained parallelism. I really want to get back to the fine-grained parallelism that transputers used to support and from which efficiencies considerably greater than 50% could be achieved. But will then I get clobbered by cache incoherency problems resulting from MPP, that clobber the otherwise beautiful cache-hit algorithms we are asked to develop for single nodes? And, of course, will I further get clobbered by long communication startup latencies and long context-switching from single nodes? None of these problems existed for transputer networks.

Our immediate problem for MPP seems to be poor cache utilisation on single nodes - and that seems to be a non-MPP-specific problem. But the MPP-specific problems are still there. User performance on the T3D seems to range from 17% all the way down to 1%. The feeling at the `Crisis' workshop was that that was *not* the fault of the users!

So, there are two questions:

If the answers are negative, we have a problem ...

Peter Welch.


[Prev] [Next]