Newsgroups: comp.sys.transputer
From: herman7@iaehv.nl (herman7)
Subject: Re: FFT algorithm
Organization: Internet Access Eindhoven, the Netherlands
Date: 31 Oct 1995 21:51:41 GMT
Mime-Version: 1.0
Content-Type: Text/Plain; charset=ISO-8859-1
Message-ID: <4765pd$nu9@iaehv.IAEhv.nl>

In article <475cm2$40l@geriatrix.bangor.ac.uk>, cardoso@sees.bangor.ac.uk 
says...
>
>Hi

Hi,


>I have managed to run a FFT program written by S. Gudvangen, U. of
>Newcastle in a T805 transputer and using the D7205 software package.
>The program takes aproximately 18.75 ms to calculate the FFT of a
>512 points data sequence and 41.98 ms for 1024 points. This was the
>only process running. This seems to me an incredible high value.
It is not bad.

>What can I do to speed up the program ?

There are a number of techniques you can apply to speed things up, such as 
loop unrolling and localizing access so that you can make good use of 
OCCAM abbreviations.

Using these techniques I came up with the following code some years ago.
It was developed using the MEIKO version of the TDS, OPS, and is used as 
one of the example codes for the Advanced OCCAM and Transputer Engineering 
Workshop that is regularly held at the University of Kent at Canterbury.

Execution time for a T800 @ 25 MHZ was around 30 ms for a single precision 
complex 1024 point FFT. Try it and tell me the results.
It is part of a distributed implementation, hence all the configurability 
stuff. I wrote a paper on it that was published in the NATUG-4 conference.
T
>
>Does anyone can point me an ftp site where I can download other
>FFT algorithms?
>
I will present you the code here. Results come out unordered. If you need 
a test rig to verify any new versions, I can mail you one too.
I hope it is of use to you.

------------------------ fft64.occ -------------
--{{{  LIB fft64
--{{{  VALues
VAL nr.of.columns IS  10: -- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12
VAL nr.of.points  IS  1 << nr.of.columns:
VAL half.points   IS  nr.of.points >> 1:
VAL LOG.processors.per.column IS 0: -- 0 1 2 3 4
VAL processors.per.column IS 1 << LOG.processors.per.column:
VAL processors.per.row IS 1: -- 1 2 3 4 5 6 7 8 9 10
VAL nr.of.workers IS processors.per.column * processors.per.row:
VAL local.points IS half.points / processors.per.column:
VAL local.columns IS nr.of.columns / processors.per.row:

PROTOCOL p1.complex IS [half.points][2]REAL64:
PROTOCOL p2.complex IS [local.points][2]REAL64:
--}}} 
--}}} 
------------------------------------------------

-------------------------- fastsmal.occ ---------
--{{{  #USE
#INCLUDE "fft64.occ"
#INCLUDE "mathhdr.occ"
#USE "maths64.lib"
--}}} 
PROC fast.seq.small.bfly(VAL INT row, col,
              CHAN OF p2.complex a.chan, b.chan, x.chan, y.chan)
  --{{{  
  --{{{  variables and VAL declarations
  [local.columns][local.points][2]REAL64 weight:
  [2][local.points << 1][2]REAL64 c1, c2:
  VAL even IS (local.columns + 1) /\ 1:
  --}}} 
  --{{{  PROC bfly
  PROC slice.bfly([][2]REAL64 input.a, input.b,
        VAL [][2]REAL64 w,
            [][2]REAL64 output)
    --{{{  explicit butterflies
    PROC bfly(VAL [2]REAL64 a, b, w, [2]REAL64 x, y)
      --{{{  butterflies
      --{{{  VAL
      VAL re IS 0:
      VAL im IS 1:
      --}}} 
      SEQ
        x[re] := (a[re] + b[re])
        x[im] := (a[im] + b[im])
        VAL REAL64 dr IS a[re] - b[re]:
        VAL REAL64 di IS a[im] - b[im]:
        SEQ
          y[re] := (w[re] * dr) - (w[im] * di)
          y[im] := (w[re] * di) + (w[im] * dr)
      --}}} 
    :
    SEQ
      --{{{  
      bfly(input.a[0], input.b[0], w[0], output[0], output[1])
      bfly(input.a[1], input.b[1], w[1], output[2], output[3])
      bfly(input.a[2], input.b[2], w[2], output[4], output[5])
      bfly(input.a[3], input.b[3], w[3], output[6], output[7])
      bfly(input.a[4], input.b[4], w[4], output[8], output[9])
      bfly(input.a[5], input.b[5], w[5], output[10], output[11])
      bfly(input.a[6], input.b[6], w[6], output[12], output[13])
      bfly(input.a[7], input.b[7], w[7], output[14], output[15])
      --}}} 
    --}}} 
  :
  --}}} 
  --{{{  PROC fft(VAL [2][][2]REAL64 a, b, [2][][2]REAL64 x, y, VAL INT 
start.col)
  PROC fft([2][local.points << 1][2]REAL64 c, VAL INT start.col)
    --{{{  compute
    SEQ i = 0 FOR local.columns
      VAL i.in IS (start.col + i) /\ 1:
      VAL shift IS 3:
      --{{{  mul etc
      VAL mul IS 1 << shift:
      VAL mul.plus.1 IS mul + 1:
      --VAL mul.times.two IS mul + mul:
      --}}} 
      input.row IS  c[i.in]:
      output.row IS c[i.in >< 1]:
      weight.row IS weight[i]:
      SEQ j = 0 FOR local.points >> shift
        --{{{  abbrev
        VAL mul.times.j IS mul TIMES j:
        VAL [][2]REAL64 w IS [weight.row FROM mul.times.j FOR mul]:
        [][2]REAL64 output IS [output.row FROM mul.times.j + mul.times.j FOR
                                               mul + mul]:
        [][2]REAL64 input.a IS [input.row FROM mul.times.j FOR mul]:
        [][2]REAL64 input.b IS [input.row FROM mul.times.j + local.points FOR 
mul]:
        --}}} 
        slice.bfly(input.a, input.b, w, output)
    --}}} 
  :
  --}}} 
  SEQ
    --{{{  init weights
    --{{{  init with W first
    VAL col.offset IS col * local.columns:
    VAL row.offset IS row * local.points:
    SEQ l.col = 0 FOR local.columns
      VAL mask IS (-1) << (col.offset + l.col):
      SEQ j = 0 FOR local.points
        SEQ
          weight[l.col][j][0] := REAL64 ROUND ((row.offset + j) /\ mask)
          weight[l.col][j][1] := REAL64 ROUND ((row.offset + j) /\ mask)
    --}}} 
    --{{{  do actual initialization
    VAL REAL64 W IS (2.0(REAL64) * DPi) / (REAL64 ROUND nr.of.points):
    SEQ i = 0 FOR local.columns
      SEQ j = 0 FOR local.points
        [2]REAL64 w IS weight[i][j]:
        SEQ
          w[0] := DCOS(w[0] * W)
          w[1] := (-DSIN(w[1] * W))
    --}}} 
    --}}} 
    PAR
      --{{{  read input buffer 1
      a.chan ? [c1[0] FROM 0 FOR local.points]
      b.chan ? [c1[0] FROM local.points FOR local.points]
      --}}} 
    PAR
      --{{{  read input buffer 2
      PAR
        a.chan ? [c2[0] FROM 0 FOR local.points]
        b.chan ? [c2[0] FROM local.points FOR local.points]
      --}}} 
      fft(c1, 0)
    WHILE TRUE
      SEQ
        PAR
          --{{{  read input buffer 1
          a.chan ? [c1[1 >< even] FROM 0 FOR local.points]
          b.chan ? [c1[1 >< even] FROM local.points FOR local.points]
          --}}} 
          --{{{  write output buffer 1
          x.chan ! [c1[0 >< even] FROM 0 FOR local.points]
          y.chan ! [c1[0 >< even] FROM local.points FOR local.points]
          --}}} 
          fft(c2, 0)
        PAR
          --{{{  read input buffer 2
          a.chan ? [c2[0 >< even] FROM 0 FOR local.points]
          b.chan ? [c2[0 >< even] FROM local.points FOR local.points]
          --}}} 
          --{{{  write output buffer 2
          x.chan ! [c2[1 >< even] FROM 0 FOR local.points]
          y.chan ! [c2[1 >< even] FROM local.points FOR local.points]
          --}}} 
          fft(c1, 0 >< even)
        PAR
          --{{{  read input buffer 1
          a.chan ? [c1[0] FROM 0 FOR local.points]
          b.chan ? [c1[0] FROM local.points FOR local.points]
          --}}} 
          --{{{  write output buffer 1
          x.chan ! [c1[1] FROM 0 FOR local.points]
          y.chan ! [c1[1] FROM local.points FOR local.points]
          --}}} 
          fft(c2, 0 >< even)
        PAR
          --{{{  read input buffer 2
          a.chan ? [c2[0] FROM 0 FOR local.points]
          b.chan ? [c2[0] FROM local.points FOR local.points]
          --}}} 
          --{{{  write output buffer 2
          x.chan ! [c2[1] FROM 0 FOR local.points]
          y.chan ! [c2[1] FROM local.points FOR local.points]
          --}}} 
          fft(c1, 0)
  --}}} 
:
-----------------------------------------------------

Call with row, col both set to 0, as the configuration file specifies just 
one worker processor.

>Thank you
>
No thanks,

>Jose Carlos
>
Herman Roebbers