From: mccalpin@frakir.engr.sgi.com (John McCalpin)
Newsgroups: comp.parallel
Subject: Re: superlinear question
Date: 26 Oct 1998 20:13:59 GMT
Organization: Silicon Graphics, Inc.
Approved: bigrigg@cs.cmu.edu
Message-Id: <712l67$h4n$1@encore.ece.cmu.edu>
Originator: bigrigg@ece.cmu.edu


In article <709cvh$7ij$1@encore.ece.cmu.edu>,
@PUB:EMAIL.COM EMDIR <PBASUKIAG@alpha7.curtin.edu.au> wrote:
>Is there anyone who can give us the SIMPLEST algorithm and codes to
>achieve SUPERLINEAR speedup.  I am using MPI (Message Passing
>Interface) and C to study the parallel algoritm.  Any help will be
>appreciated.  Please send them straight to my e-mail

It is odd that so many people in this forum miss out on
the most common cause of superlinear speedup.

We have lots of codes that show superlinear speedup for 
fixed-size problems on the Origin2000.  This will occur
any time that the benefit of the increased aggregate cache
size is bigger than the overhead of the parallelization.

In some cases, such as Version 2 of the NAS Parallel Benchmarks,
the single-processor version of the code could have been blocked
for cache, but (for whatever reason) was not.  We don't see
superlinear speedups in the Version 1 NAS Parallel Benchmarks
because we have blocked the uniprocessor versions effectively.

In other cases, blocking is not possible, and superlinear 
speedup is a pretty common occurrence, and not at all
pathological.

For an example I would probably choose an unpreconditioned
conjugate gradient solver for a sparse system (stored in
compressed diagonal format) that has a working set (for
example) 3x larger than one cpu's cache.  

The algorithm cannot be blocked to any significant degree,
because the dot products create a global data dependence at
each iteration.

On 4 cpus, the whole working set fits in cache, and the 
speed of the dot products increases by 2x-5x *per cpu*
relative to the out-of-cache version for typical cached
systems.  At this point, the overhead of communications is
negligible, and it quite unlikely to overcome this effect.

If that problem size seems artificially small, note that
SGI is currently shipping machines with 512 MB of cache
(128 cpu Origin2000).  We see superlinear speedups on
some CFD codes all the way out to the range of 100 cpus.
-- 
--
John D. McCalpin, Ph.D.      Principal Scientist
System Architecture Group    http://reality.sgi.com/mccalpin/
Silicon Graphics, Inc.       mccalpin@sgi.com  650-933-7407

--
Articles to bigrigg+parallel@cs.cmu.edu (Admin: bigrigg@cs.cmu.edu)
Archive: http://www.hensa.ac.uk/parallel/internet/usenet/comp.parallel