From: Brian.Oneill@ntu.ac.uk
Newsgroups: comp.sys.arm,comp.sys.transputer
Subject: Re: Floating Point Performance of the StrongARM
Date: Fri, 19 Feb 1999 12:47:12 GMT
Organization: Deja News - The Leader in Internet Discussion
Message-Id: <7ajmge$f7v$1@nnrp1.dejanews.com>
References: <797adq$br7$1@nnrp1.dejanews.com>
Xref: ukc comp.sys.arm:3082 comp.sys.transputer:9055


Further to my posting of 2nd Feb 1999 on integer and floating point tests on
the StrongARM, Wilco Dijkstra wrote:

>>                           Integer                  Float
>> StrongARM - 233 MHz
>>    2.11 compiler           0.366 sec             34.50 sec
>>    2.50 compiler           0.366 sec              1.29 sec
>>    J Brown’s FP                                  10.30 sec
>
>Actually, when I use the ARM tools, and compile & emulate for the
>StrongARM, I get (options -apcs /nofp/noswst/softfp -Otime -cpu >StrongARM1):
>
>2.11 tools:                   0.060 sec              3.17 sec
>2.50 tools:                   0.060 sec              0.93 sec
>
>These are more realistic results: the integer loop takes around 14 >cycles per
iteration, the floating about 220 (5 * 28 + 2 * 37 + 5).
>
>The difference is easily explained by the fact that there are 1 >million
>stores directly to slow main memory (the array isn't prefetched),
>adding about 0.30 sec to the execution time (must quite slow memory >-
>90 cycles per store!??).
>
>Try adding:
>
>ans[0] = 0;
>for (i = 1; i < 10; i++) ans[i] = ans[i-1];
>
>after the first call to Time. This both initialises ans properly and
>prefetches it too (unless your compiler is a little too good in
>loop dependency checking). You should get results within 10% of >mine.

We have now rerun this code and obtained results close to this estimate.
Revised results
                           Integer                  Float
StrongARM - 221.1 MHz
    2.50 compiler           0.088 sec              1.02 sec
    J Brown’s FP                                   1.33 sec


The difference is on our original test we had the data cache(DC) and write
buffer(WB) enabled but not the MMU.  It is necessary to enable the MMU before
using the DC or the WB.  The tests using the floating point libraries of the
ARM 2.50 compiler require only either the DC or the WB enabled to obtain the
best performance.  When using Julian Brown’s FP libraries it is necessary to
enable both DC and WB, with the DC enable producing most of the improvement
in performance.

The modification suggested above produces no improvement in the performance of
the code.  This would indicate that the results for the calculation are stored
in cache on both read and write operations.


Brian O’Neill

=====================================================================
Brian C. O'Neill              | Tel: +44 0115 948 6044
Dept of Elec & Electronic Eng | Fax: +44 0115 948 6567
Nottingham Trent University   | E-mail: Brian.ONeill@ntu.ac.uk
Nottingham                   | http://eee.ntu.ac.uk/research/parallel
NG1 4BU                       |
UK
====================================================================

Below copy of our test code.

#include "SoftInt.h"

float jsfp_add(float a, float b);
float jsfp_sub(float a, float b);
float jsfp_mul(float a, float b);
float jsfp_div(float a, float b);

void BenchMark2(void)
{
/*
    //section used for integer op
    unsigned long int i;
    unsigned long int j;
    int p, q;
    unsigned int k, l, m;
    unsigned int ans[10];
    */

    //section used for floating points op
    long unsigned int i, j;
    float p, q, k, l, m;
    float ans[10];

    Time();

    j=0;

    for (i=0;i<100000;i++)
    {
    p=4.0F;
    q=200.0F;

        //benchmarking starts here
        for (j=0;j<10;j++)
        {
        p++;
        q++;
        k = p + q;
        l = k*p;
        m = l*q;
        ans[j] = k + l + m;

/*      p = jsfp_add(p,1.0F);
        q = jsfp_add(q,1.0F);
        k = jsfp_add(p,q);
        l = jsfp_mul(k,p);
        m = jsfp_mul(l,q);
        ans[j] = jsfp_add(k,l);
        ans[j] = jsfp_add(ans[j],m);  */

          }

    }


    Time();
    i = 0;
    for (i=0;i<10;i++)
    {
    ans[0]=ans[i] + ans[0];
    }

    writeHex(ans[1]);
    Exit();

}




-----------== Posted via Deja News, The Discussion Network ==----------
http://www.dejanews.com/       Search, Read, Discuss, or Start Your Own    

