These files have been made freely available by SGS-Thomson and may not be used to generate commercial products without explicit permission and agreed licensing terms OR placed in a public archive or given to third parties without explicit written permission from SGS-Thomson in Bristol.

Tony Debling, SGS-Thomson Microelectronics Ltd, 1000 Aztec West, Almondsbury, Bristol BS12 4SQ, England. June 7, 1995.

Converted to HTML via RTFtoHTML, PERL5 and by hand by Dave Beckett <D.J.Beckett@ukc.ac.uk>.

WARNING: This has been converted from RTF format and the formulas and expressions have not been checked with the printed version. If you find any errors in this file, please send them to me, and I will update it.

Dave Beckett <D.J.Beckett@ukc.ac.uk>

Elementary maths and trigonometric functions are provided in three libraries, as follows:

Library Description snglmath.lib Single length library dblmath.lib Double length library tbmaths.lib TB optimized library

The single and double length libraries contain the same set of maths functions
in single and double length forms. The double length forms all begin with the
letter `` D`'. All function names are in upper case.

The TB optimized library is a combined single and double length library
containing functions for the T4 series (T400, T414, T425, and T426). The
functions have been optimized for speed. The standard single or double length
libraries *can* be used on T4 processors but optimum performance will be
achieved by using the TB optimized library. The accuracy of the
T400/T414/T425/T426 optimized functions is similar to that of the standard
single length functions but results returned may not be identical because
different algorithms are used. If the optimized library is used in code
compiled for any processor except a T400, T414, T425, or T426, the compiler
reports an error.

To obtain the best possible speed performance with the occam maths functions use the following strategy:

- For networks consisting of only T4 series transputers, use the
library.**tbmaths.lib** - For networks consisting of only T8 series transputers, use the
and**snglmath.lib**libraries.**dblmath.lib** - For networks consisting of a mix of T4 series and T8 series transputers use:
on the T4 series and**tbmaths.lib**or**snglmath.lib**on the T8 series when a consistent level of accuracy is not required;**dblmath.lib**- if accuracy must be the same in the T8 and T4 processes then use the
and**snglmath.lib**libraries.**dblmath.lib**

The elementary function library is also described in appendix N of the *occam
2 Reference manual*.

These function subroutines have been written to be compatible with the ANSI standard for binary floating-point arithmetic (ANSI IEEE std 754-1985), as implemented in occam. They are based on the algorithms in:

Cody, W. J., and Waite, W. M. [1980].Software Manual for the Elementary Functions.Prentice-Hall, New Jersey.

The only exceptions are the pseudo-random number generators, which are based on algorithms in:

Knuth, D. E. [1981].The Art of Computer Programming, 2nd. edition, Volume 2: Seminumerical Algorithms.Addison-Wesley, Reading, Mass.

Exceptions

Error conditions are reported by means of three distinct **NaN**s:

The implementations will return the following values for these Not-a-Numbers:

Error Single length value Double length value undefined.NaN #7F800010 #7FF00002 00000000 unstable.NaN #7F800008 #7FF00001 00000000 inexact.NaN #7F800004 #7FF00000 80000000

Range Reduction

For most of the functions the quoted error is for arguments in the primary domain, which represents the basic accuracy of the approximation. For some functions the process of range reduction results in a higher accuracy for arguments outside the primary domain, and for others it does the reverse. Refer to the notes on each function for more details.

If *f* is the mathematical function and *F* the subroutine
approximation, then the relative error at the floating-point number
*X* (provided *f*(*X*) is not zero) is:

Obviously the relative error may become very large near a zero of
*f*(*X*). If the zero is at an irrational argument (which
cannot be represented as a floating-point value), the absolute error is a
better measure of the accuracy of the function near the zero.

As it is impractical to find the relative error for every possible argument,
statistical measures of the overall error must be used. If the relative error
is sampled at a number of points *Xn* (*n* = 1 to *N*), then
useful statistics are the *maximum relative error* and the
*root-mean-square relative error*:

Corresponding statistics can be formed for the absolute error also, and are
called *MAE* and *RMSAE* respectively.

The *MRE* generally occurs near a zero of the function, especially if the
true zero is irrational, or near a singularity where the result is large, since
the `granularity' of the floating-point numbers then becomes
significant.

A useful unit of relative error is the relative magnitude of the least significant bit in the floating-point fraction, which is called one `unit in the last place' (ulp), (i.e. the smallest e such that 1+e 01). Its magnitude depends on the floating-point format: for single-length it is 2^(-23) = 1.19*10^(-7), and for double-length it is 2^(-52 )= 2.22*10^(-16).

If the relative error in the argument *X* is *d* (typically a few
ulp), then the absolute error (*E*) and relative error (*e*) in
*f*(*X*) are:

This defines the absolute and relative error magnification factors *A* and
*R*. When both are large the function is unstable, i.e. even a small error
in the argument, such as would be produced by evaluating a floating-point
expression, will cause a large error in the value of the function. The
functions return an **unstable.NaN** in such cases which are simple to
detect.

The functional forms of both *A* and *R* are given in the
specification of each function.

In both cases the higher-precision implementation was used to approximate
the mathematical function (called *f* above) in the computation of the
error, which was evaluated in the higher precision to avoid rounding errors.
Error statistics were produced according to the formulae above.

All single length functions except ` RAN` return a single
result of type

` POWER` and

` RAN` returns two results, of types

In each case the double-length version of *name* is called
**D***name*, returns a ` REAL64` (except

**A and R**Multiplying factors relating the absolute and relative errors in the output to the relative error in the argument.**Exceptions**Outputs for invalid inputs (i.e. those outside the*domain*), other than**NaN**(**NaN**s are copied directly to the output and are not listed as exceptions). These are all**Inf**s or**NaN**s.**Generated Error**The difference between the true and computed values of the function, when the argument is error-free. This is measured statistically and displayed for one or two ranges of arguments, the first of which is usually the*primary domain*(see below). The second range, if present, is chosen to illustrate the typical behavior of the function.**Domain**The range of valid inputs, i.e. those for which the output is a normal or denormal floating-point number.**MAE and RMSAE**The Maximum Absolute Error and Root-Mean-Square absolute error taken over a number of arguments drawn at random from the indicated range.**MRE and RMSRE**The Maximum Relative Error and Root-Mean-Square relative error taken over a number of arguments drawn at random from the indicated range.**Range**The range of outputs produced by all arguments in the*Domain*. The given endpoints are not exceeded.**Primary Domain**The range of arguments for which the result is computed using only a single rational approximation to the function. There is no argument reduction in this range.**Propagated Error**The absolute and relative error in the function value, given a small relative error in the argument.**ulp**The unit of relative error is the `unit in the last place' (ulp). This is the relative magnitude of the least significant bit of the floating-point fraction (i.e. the smallest e such that 1+e 0 1).

For the standard single-length format it is 2^(-23) = 1.19*10^(-7).

For the double-length format it is 2^(-52) = 2.22*10^(-16).

This is also used as a measure of absolute error, since such errors can be considered `relative' to unity.

Where the range depends on the floating-point format, single-length is indicated with an S and double-length with a D.

For functions with two arguments the complete range of both arguments is given. This means that for each number in one range, there is at least one (though sometimes only one) number in the other range such that the pair of arguments is valid. Both ranges are shown, linked by an `x'.

Pi means the closest floating-point representation of the transcendental
number p, ln(2) the closest representation of log *e*(2), and so
on.

In describing the algorithms, `*X*' is used generically to designate the
argument, and `result' (or RESULT, in the style of occam functions) to
designate the output.

These two libraries will be efficient on processors with fast
floating-point arithmetic and good support for the floating-point
predefined functions such as ` MULBY2` and

A special version has been produced for 16-bit transputers, which avoids the use of any double-precision arithmetic in the single precision functions. This is distinguished in the notes by the annotation `T2 special'; notes relating to the version for T8 and TB are denoted by `standard'.

Single and double length maths functions are listed below. Descriptions of the functions can be found in succeeding sections.

To use the single length library a program header must include the line

#USE "snglmath.lib"

To use the double length library a program header must include the line

#USE "dblmath.lib"

`
`

Result(s) Function Parameter specifiers REAL32 ALOG VAL REAL32 X REAL32 ALOG10 VAL REAL32 X REAL32 EXP VAL REAL32 X REAL32 POWER VAL REAL32 X, VAL REAL32 Y REAL32 SIN VAL REAL32 X REAL32 COS VAL REAL32 X REAL32 TAN VAL REAL32 X REAL32 ASIN VAL REAL32 X REAL32 ACOS VAL REAL32 X REAL32 ATAN VAL REAL32 X REAL32 ATAN2 VAL REAL32 X, VAL REAL32 Y REAL32 SINH VAL REAL32 X REAL32 COSH VAL REAL32 X REAL32 TANH VAL REAL32 X REAL32,INT32 RAN VAL INT32 X REAL64 DALOG VAL REAL64 X REAL64 DALOG10 VAL REAL64 X REAL64 DEXP VAL REAL64 X REAL64 DPOWER VAL REAL64 X, VAL REAL64 Y REAL64 DSIN VAL REAL64 X REAL64 DCOS VAL REAL64 X REAL64 DTAN VAL REAL64 X REAL64 DASIN VAL REAL64 X REAL64 DACOS VAL REAL64 X REAL64 DATAN VAL REAL64 X REAL64 DATAN2 VAL REAL64 X, VAL REAL64 Y REAL64 DSINH VAL REAL64 X REAL64 DCOSH VAL REAL64 X REAL64 DTANH VAL REAL64 X REAL64,INT64 DRAN VAL INT64 X

DALOG

REAL32 FUNCTION ALOG (VAL REAL32 X)

REAL64 FUNCTION DALOG (VAL REAL64 X)

Compute loge(X).

Domain: (0, XMAX] Range: [MinLog, MaxLog] [MinLog, MaxLog] (See note 2) Primary Domain: [p2/2, p2) = [0.7071, 1.4142)

**Exceptions**

All arguments outside the domain generate anundefined.NaN.

**Propagated Error**

A5 1,R+1/loge(X)

**Generated Error**

Primary Domain Error: MRE RMSRE Single Length(Standard): 1.7 ulp 0.43 ulp Single Length(T2 special): 1.6 ulp 0.42 ulp Double Length: 1.4 ulp 0.38 ulp

**The Algorithm**

- Split
*X*into its exponent*N*and fraction*F*. - Find
*LnF*, the natural log of*F*, with a floating-point rational approximation. - Compute ln(2) *
*N*with extended precision and add it to*LnF*to get the result.

**Notes**

1) The term ln(2) *Nis much easier to compute (and more accurate) thanLnF, and it is larger providedNis not 0 (i.e. for arguments outside the primary domain). Thus the accuracy of the result improves as the modulus of log(X) increases.2) The minimum value that can be produced,

MinLog, is the logarithm of the smallest denormalized floating-point number. For single lengthMinLogis -103.28, and for double length it is -744.4. The maximum valueMaxLogis the logarithm ofXMAX. For single-length it is 88.72, and for double-length it is 709.78.3) Since

Infis used to representallvalues greater thanXMAXits logarithm cannot be defined.4) This function is well-behaved and does not seriously magnify errors in the argument.

**ALOG10DALOG10**

REAL32 FUNCTION ALOG10 (VAL REAL32 X)

REAL64 FUNCTION DALOG10 (VAL REAL64 X)

Compute log10(X).

Domain: (0, XMAX] Range: [MinL10, MaxL10] (See note 2) Primary Domain: [p2/2, p2) = [0.7071, 1.4142)

**Exceptions**

All arguments outside the domain generate anundefined.NaN.

**Propagated Error**

A== log10(e),R= log10(e)/loge(X)

**Generated Error**

Primary Domain Error: MRE RMSRE Single Length (Standard): 1.70 ulp 0.45 ulp Single Length (T2 special): 1.71 ulp 0.46 ulp Double Length: 1.84 ulp 0.45 ulp

**The Algorithm**

- Set
*temp*:=.**ALOG(X)** - If
*temp*is a**NaN**, copy it to the output, otherwise set

result = log(*e*) **temp*

1) See note 1 for.ALOG2) The minimum value that can be produced,

MinL10, is the base 10 logarithm of the smallest denormalized floating-point number. For single lengthMinL10 is -44.85, and for double length it is -323.3. The maximum valueMaxL10 is the base 10 logarithm ofXMAX. For single lengthMaxL10 is 38.53, and for double-length it is 308.26.3) Since

Infis used to representallvalues greater thanXMAXits logarithm cannot be defined.4) This function is well-behaved and does not seriously magnify errors in the argument.

**EXPDEXP**

REAL32 FUNCTION EXP (VAL REAL32 X)

REAL64 FUNCTION DEXP (VAL REAL64 X)

Computee^(X).

Domain: [-Inf, MaxLog) = [-Inf, 88.72)S, [-Inf, 709.78)D Range: [0, Inf) (See note 4) Primary Domain: [-Ln2/2, Ln2/2) = [-0.3466, 0.3466)

** Exceptions**

All arguments outside the domain generate anInf.

**Propagated error**

A=Xe^(X),R=X

**Generated error**

Primary Domain Error: MRE RMSRE Single Length(Standard): 0.99 ulp 0.25 ulp Single Length(T2 special): 1.0 ulp 0.25 ulp Double Length: 1.4 ulp 0.25 ulp

The Algorithm

- Set
*N*= integer part of*X*/ln(2). - Compute the remainder of
*X*by ln(2), using extended precision arithmetic. - Compute the exponential of the remainder with a floating-point rational approximation.
- Increase the exponent of the result by
*N*. If*N*is sufficiently negative the result must be denormalized.

1)MaxLogis loge(XMAX).2) For sufficiently negative arguments (below -87.34 for single-length and below -708.4 for double-length) the output is denormalized, and so the floating-point number contains progressively fewer significant digits, which degrades the accuracy. In such cases the error can theoretically be a factor of two.

3) Although the true exponential function is never zero, for large negative arguments the true result becomes too small to be represented as a floating-point number, and

underflows to zero. This occurs for arguments below -103.9 for single-length, and below -745.2 for double-length.EXP4) The propagated error is considerably magnified for large positive arguments, but diminished for large negative arguments.

**POWERDPOWER**

REAL32 FUNCTION POWER (VAL REAL32 X, Y)

REAL64 FUNCTION DPOWER (VAL REAL64 X, Y)

ComputeX^(Y).

Domain: [0, Inf] x [-Inf, Inf] Range: (-Inf, Inf) Primary Domain: See note 3.

** Exceptions**

If the first argument is outside its domain,undefined.NaNis returned. If the true value ofX^(Y)exceedsXMAX,Infis returned. In certain other cases otherNaNs are produced: See note 2.

**Propagated Error**

A=YX^(Y)(1 [[formula]] loge(X)),R=Y(1 [[formula]] loge(X)) (See note 4)

**Generated error**

Example Range Error: MRE RMSRE (See note 3) Single Length(Standard): 1.0 ulp 0.25 ulp Single Length(T2 special): 63.1 ulp 13.9 ulp Double Length: 21.1 ulp 2.4 ulp

**The Algorithm**

Deal with special cases: either argument = 1, 0, +Infor -Inf(see note 2). Otherwise:(a) For the standard single precision:

- Compute
*L*= log*e*(*X*) in double precision, where*X*is the first argument. - Compute
*W*=*Y*x*L*in double precision, where*Y*is the second argument. - Compute
*RESULT*=*e^(W)*in single precision.

(b) For double precision, and the single precision special version:

- Compute
*L*= log2(*X*) in extended precision, where*X*is the first argument. - Compute
*W*=*Y*x*L*in extended precision, where*Y*is the second argument. - Compute
*RESULT*= 2^(*W*) in extended precision.

1) This subroutine implements the mathematical functionx^(y)to a much greater accuracy than can be attained using theandALOGfunctions, by performing each step in higher precision. The single-precision version is more efficient than usingEXPandDALOGbecause redundant tests are omitted.EXP2) Results for special cases are as follows:

First Input (X) Second Input (Y) Result t0 ANY undefined.NaN 0 v0 undefined.NaN 0 0 <Y<=XMAX 0 0 Inf unstable.NaN 0 t X t1 Inf 0 0 t X t1 -Inf Inf 1 -XMAX<=Y<=XMAX 1 1 " Inf unstable.NaN 1 t X v XMAX Inf Inf 1 t X v XMAX -Inf 0 Inf 1 v Y v Inf Inf Inf -Inf vY v -1 0 Inf -1 t Y t1 undefined.NaN otherwise 0 1 otherwise 1 X

3) Performing all the calculations in extended precision makes the double-precision algorithm very complex in detail, and having two arguments makes a primary domain difficult to specify. As an indication of accuracy, the functions were evaluated at 100,000 points logarithmically distributed over (0.1, 10.0), with the exponent linearly distributed over (-35.0, 35.0) (single-length), and (-300.0, 300.0) (double-length), producing the errors given above. The errors are much smaller if the exponent range is reduced.4) The error amplification factors are calculated on the assumption that the relative error in

Yis [[formula]] that inX, otherwise there would be separate factors for bothXandY. It can be seen that the propagated error will be greatly amplified whenever loge(X) orYis large.

**SINDSIN**

REAL32 FUNCTION SIN (VAL REAL32 X)

REAL64 FUNCTION DSIN (VAL REAL64 X)

Compute sine(X) (whereXis in radians).

Domain: [-Smax, Smax] = [-205887.4, 205887.4]S (Standard),

= [-4.2*106, 4.2*106]S (T2 special)

= [-4.29*109, 4.29*109]D Range: [-1.0, 1.0] Primary Domain: [-Pi/2, Pi/2]= [-1.57, 1.57]

**Exceptions**

All arguments outside the domain generate aninexact.NaN, except [[formula]]Inf, which generates anundefined.NaN.

**Propagated Error**

A=Xcos(X),R=Xcot(X)

**Generated error ** (See note 1)

Primary Domain [0, 2Pi] MRE RMSRE MAE RMSAE Single Length(Standard): 0.94 ulp 0.23 ulp 0.96 ulp 0.19 ulp Single Length(T2 special): 0.92 ulp 0.23 ulp 0.94 ulp 0.19 ulp Double Length: 0.90 ulp 0.22 ulp 0.91 ulp 0.18 ulp

**The Algorithm**

- Set
*N*= integer part of*X**Pi*. - Compute the remainder of
*X*by*Pi*, using extended precision arithmetic (double precision in the standard version). - Compute the sine of the remainder using a floating-point polynomial.
- Adjust the sign of the result according to the sign of the argument and
the evenness of
*N*.

Notes1) For arguments outside the primary domain the accuracy of the result depends crucially on step 2. The extra precision of step 2 is lost if

Nbecomes too large, and the cut-offSmaxis chosen to prevent this. In any case for large arguments the `granularity' of floating-point numbers becomes a significant factor. For arguments larger thanSmaxa change in the argument of 1 ulp would change more than half of the significant bits of the result, and so the result is considered to be essentially indeterminate.2) The propagated error has a complex behavior. The propagated relative error becomes large near each zero of the function (outside the primary range), but the propagated absolute error only becomes large for large arguments. In effect, the error is seriously amplified only in an interval about each irrational zero, and the width of this interval increases roughly in proportion to the size of the argument.

3) Since only the remainder of

XbyPiis used in step 3, the symmetrysin(x+np ) = [[formula]]sin(x) is preserved, although there is a complication due to differing precision representations of p.4) The output range is not exceeded. Thus the output of

is always a valid argument forSIN.ASIN

**COSDCOS**

REAL32 FUNCTION COS (VAL REAL32 X)

REAL64 FUNCTION DCOS (VAL REAL64 X)

Compute cosine(X) (whereXis in radians).

Domain: [-Cmax, Cmax] = [-205887.4, 205887.4]S (Standard),

= [-12868.0, 12868.0]S (T2 special)

= [-2.1*108, 2.1*108]D Range: [-1.0, 1.0] Primary Domain: See note 1.

**Exceptions**

All arguments outside the domain generate aninexact.NaN, except [[formula]]Inf, which generates anundefined.NaN.

**Propagated Error**

A= -Xsin(X),R= -Xtan(X)(See note 4)

**Generated error**

[0, Pi/4) [0, 2Pi] MRE RMSRE MAE RMSAE

Single Length(Standard): 0.93 ulp 0.25 ulp 0.88 ulp 0.18 ulp Single Length(T2 special): 1.1 ulp 0.3 ulp 0.94 ulp 0.19 ulp Double Length: 1.0 ulp 0.28 ulp 0.90 ulp 0.19 ulp

** The Algorithm**

- Set
*N*= integer part of (*X*)*Pi*2)*Pi*and compute the remainder of (*X*)*Pi*2) by*Pi*, using extended precision arithmetic (double precision in the standard version). - Compute the sine of the remainder using a floating-point polynomial.
- Adjust the sign of the result according to the evenness of
*N*.

Notes1) Inspection of the algorithm shows that argument reduction always occurs, thus there is no `primary domain' for

. So for all arguments the accuracy of the result depends crucially on step 2. The standard single-precision version performs the argument reduction in double-precision, so there is effectively no loss of accuracy at this step. For the T2 special version and the double-precision version there are effectivelyCOSKextra bits in the representation of p(K=8 for the former and 12 for the latter). If the argument agrees with an odd integer multiple of p/2 to more thankbits there is a loss of significant bits from the computed remainder equal to the number of extra bits of agreement, and this causes a loss of accuracy in the result.2) The difference between

evaluated at successive floating-point numbers is given approximately by the absolute error amplification factor, A. For arguments larger thanCOSCmaxthis difference may be more than half the significant bits of the result, and so the result is considered to be essentially indeterminate and aninexact.NaNis returned. The extra precision of step 2 in the double-precision and T2 special versions is lost ifNbecomes too large, and the cut-off atCmaxprevents this also.3) For small arguments the errors are not evenly distributed. As the argument becomes smaller there is an increasing bias towards negative errors (which is to be expected from the form of the Taylor series). For the single-length version and

Xin [-0.1, 0.1], 62% of the errors are negative, whilst forXin [-0.01, 0.01], 70% of them are.4) The propagated error has a complex behavior. The propagated relative error becomes large near each zero of the function, but the propagated absolute error only becomes large for large arguments. In effect, the error is seriously amplified only in an interval about each irrational zero, and the width of this interval increases roughly in proportion to the size of the argument.

5) Since only the remainder of (|

X|+Pi/2) byPiis used in step 3, the symmetry cos(x+np ) = [[formula]] cos(x) is preserved. Moreover, since the same rational approximation is used as in, the relation cos(SINx) = sin(x+p/2) is also preserved. However, in each case there is a complication due to the different precision representations of p.6) The output range is not exceeded. Thus the output of

is always a valid argument forCOS.ACOS

TAN

DTAN

REAL32 FUNCTION TAN (VAL REAL32 X)

REAL64 FUNCTION DTAN (VAL REAL64 X)

Compute tan(X) (whereXis in radians).

Domain: [-Tmax, Tmax] = [-102943.7, 102943.7]S(Standard),

= [-2.1*106, 2.1*106]S(T2 special),

= [-2.1*109, 2.1*109]D Range: (-Inf, Inf) Primary Domain: [-Pi/4, Pi/4]= [-0.785, 0.785]

**Exceptions**

All arguments outside the domain generate aninexact.NaN, except [[formula]]Inf, which generate anundefined.NaN. Odd integer multiples of p/2 may produceunstable.NaN.

**Propagated Error**

A=X(1+ tan^(2)(X)),R=X(1+ tan^(2) (X))/tan(X)(See note 3)

**Generated error**

Primary Domain Error: MRE RMSRE (See note 3) Single Length(Standard): 1.44 ulp 0.39 ulp Single Length(T2 special): 1.37 ulp 0.39 ulp Double Length: 1.27 ulp 0.35 ulp

**The Algorithm**

- Set
*N*= integer part of*X*/(*Pi*/2), and compute the remainder of*X*by*Pi*/2, using extended precision arithmetic. - Compute two floating-point rational functions of the remainder,
*XNum*and*XDen*. - If
*N*is odd, set*RESULT*= -*XDen*/*XNum*, otherwise set*RESULT*=*XNum*/*XDen*.

1)Ris large wheneverXis near to an integer multiple of p/2, and so tan is very sensitive to small errors near its zeros and singularities. Thus for arguments outside the primary domain the accuracy of the result depends crucially on step 2, so this is performed with very high precision, using double precisionPi/2 for the standard single-precision function and two double-precision floating-point numbers for the representation of p /2 for the double-precision function. The T2 special version uses two single-precision floating-point numbers. The extra precision is lost ifNbecomes too large, and the cut-offTmaxis chosen to prevent this.2) The difference between

evaluated at successive floating-point numbers is given approximately by the absolute error amplification factor,TANA. For arguments larger thanSmaxthis difference could be more than half the significant bits of the result, and so the result is considered to be essentially indeterminate and aninexact.NaNis returned.3) Tan is quite badly behaved with respect to errors in the argument. Near its zeros outside the primary domain the relative error is greatly magnified, though the absolute error is only proportional to the size of the argument. In effect, the error is seriously amplified in an interval about each irrational zero, whose width increases roughly in proportion to the size of the argument. Near its singularities both absolute and relative errors become large, so any large output from this function is liable to be seriously contaminated with error, and the larger the argument, the smaller the maximum output which can be trusted. If step 3 of the algorithm requires division by zero, an

unstable.NaNis produced instead.4) Since only the remainder of

XbyPi/2 is used in step 3, the symmetry tan(x+np ) = tan(x) is preserved, although there is a complication due to the differing precision representations of p. Moreover, by step 3 the symmetry tan(x) = 1/tan( p/2 -x) is also preserved.

**ASINDASIN**

REAL32 FUNCTION ASIN (VAL REAL32 X)

REAL64 FUNCTION DASIN (VAL REAL64 X)

Compute sine^(-1)(X) (in radians).

Domain: [-1.0, 1.0] Range: [-Pi/2, Pi/2] Primary Domain: [-0.5, 0.5]

**Exceptions**

All arguments outside the domain generate anundefined.NaN.

**Propagated Error**

A=X/p1 -X^(2),R=X/(sin^(-1)(X) p1 -X^(2))

**Generated Error**

Primary Domain [-1.0, 1.0] MRE RMSRE MAE RMSAE

Single Length: 0.58 ulp 0.21 ulp 1.35 ulp 0.33 ulp Double Length: 0.59 ulp 0.21 ulp 1.26 ulp 0.27 ulp

**The Algorithm**

- If |
*X*| > 0.5, set*Xwork*:=(1 -**SQRT(***X*) 2. Compute**)***Rwork*= arcsine(-2 **Xwork*) with a floating-point rational approximation, and set the result =*Rwork*+*Pi*/2. - Otherwise compute the result directly using the rational approximation.
- In either case set the sign of the result according to the sign of the argument.

1) The error amplification factors are large only near the ends of the domain. Thus there is a small interval at each end of the domain in which the result is liable to be contaminated with error: however since both domain and range are bounded theabsoluteerror in the result cannot be large.2) By step 1, the identity sin^(-1)(

x) = p/2 - 2 sin^(-1)(p(1-x)/2)) is preserved.

**ACOSDACOS**

REAL32 FUNCTION ACOS (VAL REAL32 X)

REAL64 FUNCTION DACOS (VAL REAL64 X)

Compute cosine^(-1)(X) (in radians).

Domain: [-1.0, 1.0] Range: [0, Pi] Primary Domain: [-0.5, 0.5]

**Exceptions**

All arguments outside the domain generate anundefined.NaN.

**Propagated Error**

A=-X/p1 -X^(2),R=-X/(sin^(-1)(X) p1 -X^(2))

**Generated Error**

Primary Domain [-1.0, 1.0] MRE RMSRE MAE RMSAE Single Length: 1.06 ulp 0.38 ulp 2.37 ulp 0.61 ulp Double Length: 0.96 ulp 0.32 ulp 2.25 ulp 0.53 ulp

**The Algorithm**

- If |
*X*| > 0.5, set*Xwork*:=(1 -**SQRT(***X*) 2. Compute**)***Rwork*= arcsine(2 **Xwork*) with a floating-point rational approximation. If the argument was positive, this is the result, otherwise set the result =*Pi*-*Rwork*. - Otherwise compute
*Rwork*directly using the rational approximation. If the argument was positive, set result =*Pi*/2 -*Rwork*, otherwise result =*Pi*/2 +*Rwork*.

1) The error amplification factors are large only near the ends of the domain. Thus there is a small interval at each end of the domain in which the result is liable to be contaminated with error, although this interval is larger near 1 than near -1, since the function goes to zero with an infinite derivative there. However since both the domain and range are bounded theabsoluteerror in the result cannot be large.2) Since the rational approximation is the same as that in

, the relation cos^(-1)(ASINx) = p/2 - sin^(-1)(x) is preserved.

**ATANDATAN**

REAL32 FUNCTION ATAN (VAL REAL32 X)

REAL64 FUNCTION DATAN (VAL REAL64 X)

Compute tan^(-1)(X) (in radians).

Domain: [-Inf, Inf] Range: [-Pi/2, Pi/2] Primary Domain: [-z, z], z = 2 - p3 = 0.2679

**Exceptions**

None.

**Propagated Error**

A=X/(1 +X^(2)),R=X/(tan^(-1)(X)(1 +X^(2)))

**Generated Error**

Primary Domain Error: MRE RMSRE Single Length: 0.56 ulp 0.21 ulp Double Length: 0.52 ulp 0.21 ulp

**The Algorithm**

- If
*|X|*> 1.0, set*Xwork*= 1/|*X*| , otherwise*Xwork*= |*X*|. - If
*Xwork*> 2-p3, set*F*= (*Xwork**p3 -1)/(*Xwork*+p3), otherwise*F*=*Xwork*. - Compute
*Rwork*= arctan(*F*) with a floating-point rational approximation. - If
*Xwork*was reduced in (2), set*R*=*Pi*/6 +*Rwork*, otherwise*R*=*Rwork*. - If
*X*was reduced in (1), set*RESULT*=*Pi*/2 -*R*, otherwise*RESULT*=*R*. - Set the sign of the
*RESULT*according to the sign of the argument.

1) For |X| >ATmax, |tan^(-1)(X)| is indistinguishable from p/2 in the floating-point format. For single-length,ATmax= 1.68*10^(7), and for double-lengthATmax= 9*10^(15), approximately.2) This function is numerically very stable, despite the complicated argument reduction. The worst errors occur just above 2-p3, but are no more than 3.2 ulp.

3) It is also very well behaved with respect to errors in the argument, i.e. the error amplification factors are always small.

4) The argument reduction scheme ensures that the identities

tan^(-1)(X) = p/2 - tan^(-1)(1/X), and tan^(-1)(X) = p/6 + tan^(-1)((p3*X-1)/(p3 +X)) are preserved.

**ATAN2DATAN2**

REAL32 FUNCTION ATAN2 (VAL REAL32 X, Y)

REAL64 FUNCTION DATAN2 (VAL REAL64 X, Y)

Compute the angular co-ordinate tan^(-1)(Y/X) (in radians) of a point whoseXandYco-ordinates are given.

Domain: [-Inf, Inf] x [-Inf, Inf] Range: (-Pi, Pi] Primary Domain: See note 2.

**Exceptions**

(0, 0) and ([[formula]]Inf,[[formula]]Inf) giveundefined.NaN.

**Propagated Error**

A=X(1 [[formula]]Y)/(X^(2) +Y^(2)),R=X(1 [[formula]]Y)/(tan^(-1)(Y/X)(X^(2) +Y^(2))) (See note 3)

**Generated Error **

See note 2.

**The Algorithm**

- If
*X*, the first argument, is zero, set the result to [[formula]] p /2, according to the sign of*Y*, the second argument. - Otherwise set
*Rwork*:=**ATAN**(*Y*/*X*`)`. Then if*Y*< 0 set*RESULT*=*Rwork*-*Pi*, otherwise set*RESULT*=*Pi*-*Rwork*.

1) This two-argument function is designed to perform rectangular-to-polar co-ordinate conversion.2) See the notes for

for the primary domain and estimates of the generated error.ATAN3) The error amplification factors were derived on the assumption that the relative error in

Yis [[formula]] that inX, otherwise there would be separate factors forXandY. They are small except near the origin, where the polar co-ordinate system is singular.

**SINHDSINH**

REAL32 FUNCTION SINH (VAL REAL32 X)

REAL64 FUNCTION DSINH (VAL REAL64 X)

Compute sinh(X).

Domain: [-Hmax, Hmax] = [-89.4, 89.4]S, [-710.5, 710.5]D Range: (-Inf, Inf) Primary Domain: (-1.0, 1.0)

**Exceptions**

X< -Hmaxgives -Inf, andX>HmaxgivesInf.

**Propagated Error**

A=Xcosh(X),R=Xcoth(X)(See note 3)

**Generated Error**

Primary Domain [1.0, XBig] (See note 2) MRE RMSRE MAE RMSAE Single Length: 0.91 ulp 0.26 ulp 1.41 ulp 0.34 ulp Double Length: 0.67 ulp 0.22 ulp 1.31 ulp 0.33 ulp

**The Algorithm**

- If |
*X*| >*XBig*, set*Rwork*:=|**EXP**(*X*| - ln(2)`)`. - If
*XBig*w |*X*| w1.0, set temp:=|**EXP**(*X*|`)`, and set*Rwork*= (*temp*- 1/*temp*)/2. - Otherwise compute sinh(|
*X*|) with a floating-point rational approximation. - In all cases, set
*RESULT*= [[formula]]*Rwork*according to the sign of*X*.

1)Hmaxis the point at which sinh(X) becomes too large to be represented in the floating-point format.2)

XBigis the point at whiche^(-)^(|X|)becomes insignificant compared withe^(|X|), (in floating-point). For single-length it is 8.32, and for double-length it is 18.37.3) This function is quite stable with respect to errors in the argument. Relative error is magnified near zero, but the absolute error is a better measure near the zero of the function and it is diminished there. For large arguments absolute errors are magnified, but since the function is itself large, relative error is a better criterion, and relative errors are not magnified unduly for any argument in the domain, although the output does become less reliable near the ends of the range.

**COSHDCOSH**

REAL32 FUNCTION COSH (VAL REAL32 X)

REAL64 FUNCTION DCOSH (VAL REAL64 X)

Compute cosh(X).

Domain: [-Hmax, Hmax] = [-89.4, 89.4]S, [-710.5, 710.5]D Range: [1.0, Inf) Primary Domain: [-XBig, XBig]= [-8.32, 8.32]S

= [-18.37, 18.37]D

**Exceptions**

|X| >HmaxgivesInf.

**Propagated Error**

A=Xsinh(X) ,R=Xtanh(X)(See note 3)

**Generated Error**

Primary Domain Error: MRE RMS Single Length: 1.24 ulp 0.32 ulp Double Length: 1.24 ulp 0.33 ulp

**The Algorithm**

- If |
*X*| >*XBig*, set*result*:=|**EXP**(*X*| - ln(2)`)`. - Otherwise, set
*temp*:=|**EXP**(*X*|`)`, and set*result*= (*temp*+ 1/*temp*)/2.

1)Hmaxis the point at which cosh(X) becomes too large to be represented in the floating-point format.2)

XBigis the point at whiche^(-)^(|X|)becomes insignificant compared withe^(|X|)(in floating-point).3) Errors in the argument are not seriously magnified by this function, although the output does become less reliable near the ends of the range.

**TANHDTANH **

REAL32 FUNCTION TANH (VAL REAL32 X)

REAL64 FUNCTION DTANH (VAL REAL64 X)

Compute tanh(X).

Domain: [-Inf, Inf] Range: [-1.0, 1.0] Primary Domain: [-Log(3)/2, Log(3)/2] = [-0.549, 0.549]

**Exceptions**

None.

**Propagated Error**

A=X/cosh^(2)(X),R=X/sinh(X) cosh(X)

**Generated Error**

Primary Domain Error: MRE RMS Single Length: 0.53 ulp 0.2 ulp Double Length: 0.53 ulp 0.2 ulp

**The Algorithm**

- If |
*X*| > ln(3)/2, set*temp*:=|**EXP**(*X*|/2`)`. Then set*Rwork*= 1 - 2/(1+*temp*). - Otherwise compute
*Rwork*= tanh(|*X*|) with a floating-point rational approximation. - In both cases, set
*RESULT*= [[formula]]*Rwork*according to the sign of*X*.

1) As a floating-point number, tanh(X) becomes indistinguishable from its asymptotic values of [[formula]]1.0 for |X|>HTmax, whereHTmaxis 8.4 for single-length, and 19.06 for double-length. Thus the output ofis equal to [[formula]]1.0 for suchTANHX.2) This function is very stable and well-behaved, and errors in the argument are always diminished by it.

**RANDRAN**

REAL32,INT32 FUNCTION RAN (VAL INT32 X)

REAL64,INT64 FUNCTION DRAN (VAL INT64 X)

These produce a pseudo-random sequence of integers, or a corresponding sequence of floating-point numbers between zero and one. X is the seed integer that initiates the sequence.

Domain: Integers (see note 1) Range: [0.0, 1.0] x Integers

**Exceptions**

None.

**The Algorithm**

- Produce the next integer in the sequence:
*Nk+1*= (*aNk*+ 1)*mod M* - Treat
*Nk+1*as a fixed-point fraction in [0,1), and convert it to floating point. - Output the floating point result and the new integer.

1) This function has two results, the first a real, and the second an integer (both 32 bits for single-length, and 64 bits for double-length). The integer is used as the argument for the next call to, i.e. it `carries' the pseudo-random linear congruential sequenceRANNk, and it should be kept in scope for as long asis used. It should be initialized before the first call toRANRANbut not modified thereafter except by the function itself.2) If the integer parameter is initialized to the same value, the same sequence (both floating-point and integer) will be produced. If a different sequence is required for each run of a program it should be initialized to some `random' value, such as the output of a timer.

3) The integer parameter can be copied to another variable or used in expressions requiring random integers. The topmost bits are the most random. A random integer in the range [0,

L] can conveniently be produced by taking the remainder by (L+1) of the integer parameter shifted right by one bit. If the shift is not done an integer in the range [-L,L] will be produced.4) The modulus

Mis 2^(32) for single-length and 2^(64) for double-length, and the multipliers,a, have been chosen so that allMintegers will be produced before the sequence repeats. However several different integers can produce the same floating-point value and so a floating-point output may be repeated, although thesequenceof such will not be repeated untilMcalls have been made.5) The floating-point result is uniformly distributed over the output range, and the sequence passes various tests of randomness, such as the `run test', the `maximum of 5 test' and the `spectral test'.

6) The double-length version is slower to execute, but `more random' than the single-length version. If a highly-random sequence of single-length numbers is required, this could be produced by converting the output of

to single-length. Conversely if only a relatively crude sequence of double-length numbers is required,DRANcould be used for higher speed and its output converted to double-length.RAN

#USE "tbmaths.lib"

The version of the library described by this section has been written for 32-bit processors without hardware for floating-point arithmetic. Functions from it will give results very close, but not identical to, those produced by the corresponding functions from the single and double length libraries.

This is the version specifically intended to derive maximum performance from
the IMS T400, T414, T425, and T426 processors. The single-precision
functions make use of the ` FMUL` instruction available on
32-bit processors without floating-point hardware. The library is
compiled for transputer class

The tables and notes at the beginning of section 1.4 apply equally here. However all the functions are contained in one library.

DALOG

REAL32 FUNCTION ALOG (VAL REAL32 X)

REAL64 FUNCTION DALOG (VAL REAL64 X)

These compute: loge(X)

Domain: (0, XMAX] Range: [MinLog, MaxLog] (See note 2) Primary Domain: [p2/2, p2) = [0.7071, 1.4142)

**Exceptions**

All arguments outside the domain generate anundefined.NaN.

**Propagated Error**

A51,R= loge(X)

**Generated Error**

Primary Domain Error: MRE RMSRE Single Length: 1.19 ulp 0.36 ulp Double Length: 2.4 ulp 1.0 ulp

**The Algorithm**

- Split
*X*into its exponent*N*and fraction*F*. - Find the natural log of
*F*with a fixed-point rational approximation, and convert it into a floating-point number*LnF*. - Compute ln(2) *
*N*with extended precision and add it to*LnF*to get the result.

1) The term ln(2) *Nis much easier to compute (and more accurate) thanLnF, and it is larger providedNis not 0 (i.e. for arguments outside the primary domain). Thus the accuracy of the result improves as the modulus of log(X) increases.2) The minimum value that can be produced,

MinLog, is the logarithm of the smallest denormalized floating-point number. For single lengthMinLogis -103.28, and for double length it is -744.4. The maximum valueMaxLogis the logarithm ofXMAX. For single-length it is 88.72, and for double-length it is 709.78.3) Since

Infis used to representallvalues greater thanXMAXits logarithm cannot be defined.4) This function is well-behaved and does not seriously magnify errors in the argument.

**ALOG10**

DALOG10`
`

REAL32 FUNCTION ALOG10 (VAL REAL32 X)

REAL64 FUNCTION DALOG10 (VAL REAL64 X)

These compute: log10(X)

Domain: (0, XMAX] Range: [MinL10, MaxL10] (See note 2) Primary Domain: [p2/2, p2) = [0.7071, 1.4142)

**Exceptions**

All arguments outside the domain generate anundefined.NaN.

**Propagated Error**

A5 log10(e),R= log10(e)/loge(X)

**Generated Error**

Primary Domain Error: MRE RMSRE Single Length: 1.43 ulp 0.39 ulp Double Length: 2.64 ulp 0.96 ulp

**The Algorithm**

- Set
*temp*:=.**ALOG(X)** - If
*temp*is a**NaN**, copy it to the output, otherwise set

result = log(*e*) **temp*.

1) See note 1 for.ALOG2) The minimum value that can be produced,

MinL10, is the base 10 logarithm of the smallest denormalized floating-point number. For single lengthMinL10 is -44.85, and for double length it is -323.3. The maximum valueMaxL10 is the base 10 logarithm ofXMAX. For single lengthMaxL10 is 38.53, and for double-length it is 308.26.3) Since

Infis used to representallvalues greater thanXMAXits logarithm cannot be defined.4) This function is well-behaved and does not seriously magnify errors in the argument.

**EXP**

DEXP`
`

REAL32 FUNCTION EXP (VAL REAL32 X)

REAL64 FUNCTION DEXP (VAL REAL64 X)

These compute:e^(X)

Domain: [-Inf, MaxLog) = [-Inf, 88.03)S, [-Inf, 709.78)D Range: [0, Inf) (See note 4) Primary Domain: [-Ln2/2, Ln2/2) = [-0.3466, 0.3466)

**Exceptions**

All arguments outside the domain generate anInf.

**Propagated Error**

A=Xe^(X),R=X

**Generated Error**

Primary Domain Error: MRE RMSRE Single Length: 0.51 ulp 0.21 ulp Double Length: 0.5 ulp 0.21 ulp

**The Algorithm**

- Set
*N*= integer part of*X*/ln(2). - Compute the remainder of
*X*by ln(2), using extended precision arithmetic. - Convert the remainder to fixed-point, compute its exponential using a fixed-point rational function, and convert the result back to floating point.
- Increase the exponent of the result by
*N*. If*N*is sufficiently negative the result must be denormalized.

1)MaxLogis loge(XMAX).2) The analytical properties of

e^(x)make the relative error of the result proportional to the absolute error of the argument. Thus the accuracy of step 2, which prepares the argument for the rational approximation, is crucial to the performance of the subroutine. It is completely accurate whenN= 0, i.e. in the primary domain, and becomes less accurate as the magnitude ofNincreases. SinceNcan attain larger negative values than positive ones,is least accurate for large, negative arguments.EXP3) For sufficiently negative arguments (below -87.34 for single-length and below -708.4 for double-length) the output is denormalized, and so the floating-point number contains progressively fewer significant digits, which degrades the accuracy. In such cases the error can theoretically be a factor of two.

4) Although the true exponential function is never zero, for large negative arguments the true result becomes too small to be represented as a floating-point number, and

underflows to zero. This occurs for arguments below -103.9 for single-length, and below -745.2 for double-length.EXP5) The propagated error is considerably magnified for large positive arguments, but diminished for large negative arguments.

**POWER**

DPOWER`
`

REAL32 FUNCTION POWER (VAL REAL32 X, Y)

REAL32 FUNCTION DPOWER (VAL REAL64 X, Y)

These compute:X^(Y)

Domain: [0, Inf] x [-Inf, Inf] Range: (-Inf, Inf) Primary Domain: See note 3.

**Exceptions**

If the first argument is outside its domain,undefined.NaNis returned. If the true value ofX^(Y)exceedsXMAX,Infis returned. In certain other cases otherNaNs are produced: See note 2.

**Propagated Error**

A=YX^(Y)(1 [[formula]] loge(X)),R=Y(1 [[formula]] loge(X)) (See note 4)

**Generated Error**

Example Range Error: MRE RMSRE (See note 3) Single Length: 1.0 ulp 0.24 ulp Double Length: 13.2 ulp 1.73 ulp

**The Algorithm**

Deal with special cases: either argument = 1, 0, +Infor -Inf(see note 2). Otherwise:(a) For single precision:

- Compute
*L*= log2(*X*) in fixed point, where*X*is the first argument. - Compute
*W*=*Y*x*L*in double precision, where*Y*is the second argument. - Compute 2^(
*W*) in fixed point and convert to floating-point result.

(b) For double precision:

- Compute
*L*= log2(*X*) in extended precision, where*X*is the first argument. - Compute
*W*=*Y*x*L*in extended precision, where*Y*is the second argument. - Compute
*RESULT*= 2^(*W*) in extended precision.

1) This subroutine implements the mathematical functionx^(y)to a much greater accuracy than can be attained using theandALOGfunctions, by performing each step in higher precision.EXP2) Results for special cases are as follows:

First Input (X) Second Input (Y) Result t0 ANY undefined.NaN 0 v0 undefined.NaN 0 0 <Y<=XMAX 0 0 Inf unstable.NaN 0 t X t1 Inf 0 0 t X t1 -Inf Inf 1 -XMAX<=Y<=XMAX 1 1 " Inf unstable.NaN 1 t X v XMAX Inf Inf 1 t X v XMAX -Inf 0 Inf 1 v Y v Inf Inf Inf -Inf vY v -1 0 Inf -1 t Y t1 undefined.NaN otherwise 0 1 otherwise 1 X

3) Performing all the calculations in extended precision makes the double-precision algorithm very complex in detail, and having two arguments makes a primary domain difficult to specify. As an indication of accuracy, the functions were evaluated at 100,000 points logarithmically distributed over (0.1, 10.0), with the exponent linearly distributed over (-35.0, 35.0) (single-length), and (-300.0, 300.0) (double-length), producing the errors given above. The errors are much smaller if the exponent range is reduced.4) The error amplification factors are calculated on the assumption that the relative error in

Yis [[formula]] that inX, otherwise there would be separate factors for bothXandY. It can be seen that the propagated error will be greatly amplified whenever loge(X) orYis large.

**The Algorithm**

- Compute
*L*= log2(*X*) in fixed point, where*X*is the first argument. - Compute
*W*=*Y*x*L*in double precision, where*Y*is the second argument. - Compute 2^(
*W*) in fixed point and convert to floating-point result. - Compute
*L*= log2(*X*) in extended precision, where*X*is the first argument. - Compute
*W*=*Y*x*L*in extended precision, where*Y*is the second argument. - Compute
*RESULT*= 2^(*W*) in extended precision.

DSIN

REAL32 FUNCTION SIN (VAL REAL32 X)

REAL64 FUNCTION DSIN (VAL REAL64 X)

These compute: sine(X)(whereXis in radians)

Domain: [-Smax, Smax] = [-12868.0, 12868.0]S,

= [-2.1*108, 2.1*108]D Range: [-1.0, 1.0] Primary Domain: [-Pi/2, Pi/2] = [-1.57, 1.57]

**Exceptions**

All arguments outside the domain generate aninexact.NaN, except [[formula]]Inf, which generates anundefined.NaN.

Propagated Error

A=Xcos(X),R=Xcot(X)

Generated Error(See note 3)

Range: Primary Domain [0, 2Pi] MRE RMSRE MAE RMSAE Single Length: 0.65 ulp 0.22 ulp 0.74 ulp 0.18 ulp Double Length: 0.56 ulp 0.21 ulp 0.64 ulp 0.16 ulp

**The Algorithm**

- Set
*N*= integer part of |*X*|/*Pi*. - Compute the remainder of |
*X*| by*Pi*, using extended precision arithmetic. - Convert the remainder to fixed-point, compute its sine using a fixed-point rational function, and convert the result back to floating point.
- Adjust the sign of the result according to the sign of the argument and
the evenness of
*N*.

1) For arguments outside the primary domain the accuracy of the result depends crucially on step 2. The extended precision corresponds toKextra bits in the representation of p (K= 8 for single-length and 12 for double-length). If the argument agrees with an integer multiple of p to more thanKbits there is a loss of significant bits in the remainder, equal to the number of extra bits of agreement, and this causes a loss of accuracy in the result.2) The extra precision of step 2 is lost if

Nbecomes too large, and the cut-offSmaxis chosen to prevent this. In any case for large arguments the `granularity' of floating-point numbers becomes a significant factor. For arguments larger thanSmaxa change in the argument of 1 ulp would change more than half of the significant bits of the result, and so the result is considered to be essentially indeterminate.3) The propagated error has a complex behavior. The propagated relative error becomes large near each zero of the function (outside the primary range), but the propagated absolute error only becomes large for large arguments. In effect, the error is seriously amplified only in an interval about each irrational zero, and the width of this interval increases roughly in proportion to the size of the argument.

4) Since only the remainder of

XbyPiis used in step 3, the symmetry sin(x+np ) = [[formula]] sin(x) is preserved, although there is a complication due to differing precision representations of p.5) The output range is not exceeded. Thus the output of

is always a valid argument forSIN.ASIN

**COS**

DCOS`
`

REAL32 FUNCTION COS (VAL REAL32 X)

REAL64 FUNCTION DCOS (VAL REAL64 X)

These compute: cosine (X)(whereXis in radians)

Domain: [-Smax, Smax] = [-12868.0, 12868.0]S,

= [-2.1*108, 2.1*108]D Range: [-1.0, 1.0] Primary Domain: See note 1.

**Exceptions**

All arguments outside the domain generate aninexact.NaN, except [[formula]]Inf, which generates anundefined.NaN.

**Propagated Error**

A= -Xsin(X),R= -Xtan(X)(See note 4)

**Generated Error**

Range: [0,Pi/4) [0, 2Pi] MRE RMSRE MAE RMSAE Single Length: 1.0 ulp 0.28 ulp 0.81 ulp 0.17 ulp Double Length: 0.93 ulp 0.26 ulp 0.76 ulp 0.18 ulp

**The Algorithm**

- Set
*N*= integer part of (|*X*|+*Pi*/2)/*Pi*. - Compute the remainder of (|
*X*|+*Pi*/2) by*Pi*, using extended precision arithmetic. - Compute the remainder to fixed-point, compute its sine using a fixed-point rational function, and convert the result back to floating point.
- Adjust the sign of the result according to the evenness of
*N*.

1) Inspection of the algorithm shows that argument reduction always occurs, thus there is no `primary domain' for. So for all arguments the accuracy of the result depends crucially on step 2. The extended precision corresponds toCOSKextra bits in the representation of p (K= 8 for single-length and 12 for double length). If the argument agrees with an odd integer multiple of p/2 to more thanKbits there is a loss of significant bits in the remainder, equal to the number of extra bits of agreement, and this causes a loss of accuracy in the result.2) The extra precision of step 2 is lost if

Nbecomes too large, and the cut-offSmaxis chosen to prevent this. In any case for large arguments the `granularity' of floating-point numbers becomes a significant factor. For arguments larger thanSmaxa change in the argument of 1 ulp would change more than half of the significant bits of the result, and so the result is considered to be essentially indeterminate.3) For small arguments the errors are not evenly distributed. As the argument becomes smaller there is an increasing bias towards negative errors (which is to be expected from the form of the Taylor series). For the single-length version and

Xin [-0.1, 0.1], 62% of the errors are negative, whilst forXin [-0.01, 0.01], 70% of them are.4) The propagated error has a complex behavior. The propagated relative error becomes large near each zero of the function, but the propagated absolute error only becomes large for large arguments. In effect, the error is seriously amplified only in an interval about each irrational zero, and the width of this interval increases roughly in proportion to the size of the argument.

5) Since only the remainder of (|

X|+Pi/2) byPiis used in step 3, the symmetry cos(x+np ) = [[formula]] cos(x) is preserved. Moreover, since the same rational approximation is used as in, the relation cos(SINx) =sin(x+ p/2) is also preserved. However, in each case there is a complication due to the different precision representations of p.6) The output range is not exceeded. Thus the output of

is always a valid argument forCOS.ACOS

**TAN**

DTAN`
`

REAL32 FUNCTION TAN (VAL REAL32 X)

REAL64 FUNCTION DTAN (VAL REAL64 X)

These compute: tan(X)(whereXis in radians)

Domain: [-Tmax, Tmax] = [-6434.0, 6434.0]S = [-1.05*108, 1.05*108]D Range: (-Inf, Inf) Primary Domain: [-Pi/4, Pi/4] = [-0.785, 0.785]

**Exceptions**

All arguments outside the domain generate aninexact.NaN, except [[formula]]Inf, which generate anundefined.NaN. Odd integer multiples of p/2 may produceunstable.NaN.

**Propagated Error**

A=X(1+ tan^(2)(X)),R=X(1+ tan^(2)(X))/tan(X)(See note 4)

**Generated Error**

Primary Domain Error: MRE RMSRE Single Length: 3.5 ulp 0.23 ulp Double Length: 0.69 ulp 0.23 ulp

**The Algorithm**

- Set
*N*= integer part of*X*/(*Pi*/2). - Compute the remainder of
*X*by*Pi*/2, using extended precision arithmetic. - Convert the remainder to fixed-point, compute its tangent using a fixed-point rational function, and convert the result back to floating point.
- If
*N*is odd, take the reciprocal. - Set the sign of the result according to the sign of the argument.

1)Ris large wheneverXis near to an integer multiple of p/2, and so tan is very sensitive to small errors near its zeros and singularities. Thus for arguments outside the primary domain the accuracy of the result depends crucially on step 2. The extended precision corresponds toKextra bits in the representation of p/2 (K= 8 for single-length and 12 for double-length). If the argument agrees with an integer multiple of p/2 to more thanKbits there is a loss of significant bits in the remainder, approximately equal to the number of extra bits of agreement, and this causes a loss of accuracy in the result.2) The extra precision of step 2 is lost if

Nbecomes too large, and the cut-offTmaxis chosen to prevent this. In any case for large arguments the `granularity' of floating-point numbers becomes a significant factor. For arguments larger thanTmaxa change in the argument of 1 ulp would change more than half of the significant bits of the result, and so the result is considered to be essentially indeterminate.3) Step 3 of the algorithm has been slightly modified in the double-precision version from that given in Cody & Waite to avoid fixed-point underflow in the polynomial evaluation for small arguments.

4) Tan is quite badly behaved with respect to errors in the argument. Near its zeros outside the primary domain the relative error is greatly magnified, though the absolute error is only proportional to the size of the argument. In effect, the error is seriously amplified in an interval about each irrational zero, whose width increases roughly in proportion to the size of the argument. Near its singularities both absolute and relative errors become large, so any large output from this function is liable to be seriously contaminated with error, and the larger the argument, the smaller the maximum output which can be trusted. If step 4 of the algorithm requires division by zero, an

unstable.NaNis produced instead.5) Since only the remainder of

XbyPi/2 is used in step 3, the symmetry tan(x+np) = tan(x) is preserved, although there is a complication due to the differing precision representations of p. Moreover, by step 4 the symmetry tan(x) = 1/ tan( p/2 -x) is also preserved.

**ASIN**

DASIN`
`

REAL32 FUNCTION ASIN (VAL REAL32 X)

REAL64 FUNCTION DASIN (VAL REAL64 X)

These compute: sin^(-1)(X)(in radians)

Domain: [-1.0, 1.0] Range: [-Pi/2, Pi/2] Primary Domain: [-0.5, 0.5]

**Exceptions**

All arguments outside the domain generate anundefined.NaN.

**Propagated Error**

A=X/p1 -X^(2),R=X/(sin^(-1)(X) p1 -X^(2))

**Generated Error**

Primary Domain [-1.0, 1.0] MRE RMSRE MAE RMSAE Single Length: 0.53 ulp 0.21 ulp 1.35 ulp 0.33 ulp Double Length: 2.8 ulp 1.4 ulp 2.34 ulp 0.64 ulp

**The Algorithm**

- If |
*X*|> 0.5, set*Xwork*:=(1 - |**SQRT**(*X*|)/2`)`.

Compute*Rwork*= arcsine(-2 **Xwork*) with a floating-point rational approximation, and set the result =*Rwork*+*Pi*/2. - Otherwise compute the result directly using the rational approximation.
- In either case set the sign of the result according to the sign of the argument.

1) The error amplification factors are large only near the ends of the domain. Thus there is a small interval at each end of the domain in which the result is liable to be contaminated with error: however since both domain and range are bounded theabsoluteerror in the result cannot be large.2) By step 1, the identity sin^(-1)(

x) = p/2 - 2 sin^(-1)(p(1-x)/2)) is preserved.

**ACOS**

DACOS`
`

REAL32 FUNCTION ACOS (VAL REAL32 X)

REAL64 FUNCTION DACOS (VAL REAL64 X)

These compute: cosine^(-1)(X)(in radians)

Domain: [-1.0, 1.0] Range: [0, Pi] Primary Domain: [-0.5, 0.5]

**Exceptions**

All arguments outside the domain generate anundefined.NaN.

**Propagated Error**

A= -X/p1 -X^(2),R= -X/(sin^(-1)(X) p1 -X^(2))

**Generated Error**

Primary Domain [-1.0, 1.0] MRE RMSRE MAE RMSAE Single Length: 1.1 ulp 0.38 ulp 2.4 ulp 0.61 ulp Double Length: 1.3 ulp 0.34 ulp 2.9 ulp 0.78 ulp

**The Algorithm**

- If |
*X*| > 0.5, set*Xwork*:=(1- |**SQRT**(*X*|)/2`).`Compute*Rwork*= arcsine (2 **Xwork*) with a floating-point rational approximation. If the argument was positive, this is the result, otherwise set the result =*Pi*-*Rwork*. - Otherwise compute
*Rwork*directly using the rational approximation. If the argument was positive, set result =*Pi*/2 -*Rwork*, otherwise result =*Pi*/2 +*Rwork*.

1) The error amplification factors are large only near the ends of the domain. Thus there is a small interval at each end of the domain in which the result is liable to be contaminated with error, although this interval is larger near 1 than near -1, since the function goes to zero with an infinite derivative there. However since both the domain and range are bounded theabsoluteerror in the result cannot be large.2) Since the rational approximation is the same as that in

, the relation cos^(-1)(ASINx) = p/2 - sin^(-1)(x)is preserved.

**ATAN**

DATAN`
`

REAL32 FUNCTION ATAN (VAL REAL32 X)

REAL64 FUNCTION DATAN (VAL REAL64 X)

These compute: tan^(-1)(X)(in radians)

Domain: [-Inf, Inf] Range: [-Pi/2, Pi/2] Primary Domain: [-z, z], z = 2 - p3 = 0.2679

**Exceptions**

None.

**Propagated Error**

A=X/(1 +X^(2)),R=X/(tan^(-1)(X)(1 +X^(2)))

**Generated Error**

Primary Domain Error: MRE RMSRE Single Length: 0.53 ulp 0.21 ulp Double Length: 1.27 ulp 0.52 ulp

**The Algorithm**

- If |
*X*| > 1.0, set*Xwork*= 1/|*X*|, otherwise*Xwork*= |*X*|. - If
*Xwork*> 2-p3, set*F*= (*Xwork**p3 -1)/(*Xwork*+p3), otherwise*F*=*Xwork*. - Compute
*Rwork*= arctan(*F*) with a floating-point rational approximation. - If
*Xwork*was reduced in (2), set*R*=*Pi*/6 +*Rwork*, otherwise*R*=*Rwork*. - If
*X*was reduced in (1), set*RESULT*=*Pi*/2 -*R*, otherwise*RESULT*=*R*. - Set the sign of the
*RESULT*according to the sign of the argument.

1) For |X| >ATmax, |tan^(-1)(X)| is indistinguishable from p/2 in the floating-point format. For single-length,ATmax= 1.68*10^(7), and for double-lengthATmax= 9*10^(15), approximately.2) This function is numerically very stable, despite the complicated argument reduction. The worst errors occur just above 2-p3, but are no more than 1.8 ulp. 3) It is also very well behaved with respect to errors in the argument, i.e. the error amplification factors are always small.

4) The argument reduction scheme ensures that the identities

tan^(-1)(X) = p/2 - tan^(-1)(1/X), and tan^(-1)(X) = p/6 + tan^(-1)((p3*X-1)/(p3 +X)) are preserved.

**ATAN2**

DATAN2`
`

REAL32 FUNCTION ATAN2 (VAL REAL32 X, Y)

REAL64 FUNCTION DATAN2 (VAL REAL64 X, Y)

These compute the angular co-ordinate tan^(-1)(Y/X) (in radians) of a point whoseXandYco-ordinates are given.

Domain: [-Inf, Inf] x [-Inf, Inf] Range: (-Pi, Pi] Primary Domain: See note 2.

**Exceptions**

(0, 0) and ([[formula]]Inf,[[formula]]Inf) giveundefined.NaN.

**Propagated Error**

A=X(1 [[formula]]Y)/(X^(2) +Y^(2)),R=X(1 [[formula]]Y)/(tan^(-1)(Y/X)(X^(2) +Y^(2))) (See note 3)

**Generated Error**

See note 2.

**The Algorithm**

- If
*X*, the first argument, is zero, set the result to [[formula]] p/2, according to the sign of*Y*, the second argument. - Otherwise set
*Rwork*:=**ATAN**(*Y*/*X*`).`Then if*Y*< 0 set*RESULT*=*Rwork*-*Pi*, otherwise set*RESULT*=*Pi*-*Rwork*.

1) This two-argument function is designed to perform rectangular-to-polar co-ordinate conversion.2) See the notes for

for the primary domain and estimates of the generated error.ATAN3) The error amplification factors were derived on the assumption that the relative error in

Yis [[formula]] that inX, otherwise there would be separate factors forXandY. They are small except near the origin, where the polar co-ordinate system is singular.

**SINH**

DSINH`
`

REAL32 FUNCTION SINH (VAL REAL32 X)

REAL64 FUNCTION DSINH (VAL REAL64 X)

These compute: sinh(X)

Domain:[-Hmax, Hmax] = [-89.4, 89.4]S, [-710.5, 710.5]D Range: (-Inf, Inf) Primary Domain: (-1.0, 1.0)

**Exceptions**

X< -Hmaxgives -Inf, andX>HmaxgivesInf.

**Propagated Error**

A=Xcosh(X),R=Xcoth(X)(See note 3)

**Generated Error**

Primary Domain [1.0, XBig]

(See note 2) MRE RMSRE MAE RMSAE Single Length: 0.89 ulp 0.3 ulp 0.98 ulp 0.31 ulp Double Length: 1.3 ulp 0.51 ulp 1.0 ulp 0.3 ulp

**The Algorithm**

- If |
*X*| >*XBig*, set*Rwork*:=|**EXP**(*X*| - ln(2)`)`. - If
*XBig*w |*X*| w 1.0, set*temp*:=|**EXP**(*X*|`)`, and set*Rwork*= (*temp*- 1/*temp*)/2. - Otherwise compute
*Rwork*= sinh(|*X*|) with a fixed-point rational approximation. - In all cases, set
*RESULT*= [[formula]]*Rwork*according to the sign of*X*.

1)Hmaxis the point at which sinh(X) becomes too large to be represented in the floating-point format.2)

XBigis the point at whiche^(-)^(|X|)becomes insignificant compared withe^(|X|), (in floating-point). For single-length it is 8.32, and for double-length it is 18.37.3) This function is quite stable with respect to errors in the argument. Relative error is magnified near zero, but the absolute error is a better measure near the zero of the function and it is diminished there. For large arguments absolute errors are magnified, but since the function is itself large, relative error is a better criterion, and relative errors are not magnified unduly for any argument in the domain, although the output does become less reliable near the ends of the range.

**COSH**

DCOSH`
`

REAL32 FUNCTION COSH (VAL REAL32 X)

REAL64 FUNCTION DCOSH (VAL REAL64 X)

These compute: cosh(X)

Domain: [-Hmax, Hmax] = [-89.4, 89.4]S, [-710.5, 710.5]D Range: [1.0, Inf) PrimaryDomain: [-XBig, XBig] = [-8.32, 8.32]S

= [-18.37, 18.37]D

**Exceptions**

|X| >HmaxgivesInf.

**Propagated Error**

A=Xsinh(X) ,R=Xtanh(X)(See note 3)

**Generated Error**

Primary Domain Error: MRE RMS Single Length: 0.99 ulp 0.3 ulp Double Length: 1.23 ulp 0.3 ulp

**The Algorithm**

- If |
*X*| >*XBig*, set*result*:=|**EXP**(*X*| - ln(2)`)`. - Otherwise, set
*temp*:=|**EXP**(*X*|`)`, and set*result*= (*temp*+ 1/*temp*)/2.

1)Hmaxis the point at which cosh(X) becomes too large to be represented in the floating-point format.2)

XBigis the point at whiche^(-)^(|X|)becomes insignificant compared withe^(|X|) (in floating-point).3) Errors in the argument are not seriously magnified by this function, although the output does become less reliable near the ends of the range.

**TANH**

DTANH`
`

REAL32 FUNCTION TANH (VAL REAL32 X)

REAL64 FUNCTION DTANH (VAL REAL64 X)

These compute: tanh(X)

Domain: [-Inf, Inf] Range: [-1.0, 1.0] Primary Domain: [-Log(3)/2, Log(3)/2] = [-0.549, 0.549]

**Exceptions**

None.

**Propagated Error**

A=X/cosh^(2)(X),R=X/sinh (X) cosh(X)

**Generated Error**

Primary Domain Error: MRE RMS Single Length: 0.52 ulp 0.2 ulp Double Length: 4.6 ulp 2.6 ulp

**The Algorithm**

- If |
*X*| > ln(3)/2, set*temp*:=|**EXP**(*X*|/2`)`. Then set*Rwork*= 1 - 2/(1+*temp*). - Otherwise compute
*Rwork*= tanh(|*X*|) with a floating-point rational approximation. - In both cases, set
*RESULT*= [[formula]]*Rwork*according to the sign of*X*.

1) As a floating-point number, tanh(X) becomes indistinguishable from its asymptotic values of [[formula]]1.0 for |X| >HTmax, whereHTmaxis 8.4 for single-length, and 19.06 for double-length. Thus the output ofis equal to [[formula]]1.0 for suchTANHX.2) This function is very stable and well-behaved, and errors in the argument are always diminished by it.

**RAN**

DRAN`
`

REAL32,INT32 FUNCTION RAN (VAL INT32 X)

REAL64,INT64 FUNCTION DRAN (VAL INT64 X)

These produce a pseudo-random sequence of integers, and a corresponding sequence of floating-point numbers between zero and one.

Domain: Integers (see note 1) Range: [0.0, 1.0] x Integers

**Exceptions**

None.

**The Algorithm**

- Produce the next integer in the sequence:
*Nk+1*= (*aNk*+ 1)*mod M* - Treat
*Nk+1*as a fixed-point fraction in [0,1), and convert it to floating point. - Output the floating point result and the new integer.

1) This function has two results, the first a real, and the second an integer (both 32 bits for single-length, and 64 bits for double-length). The integer is used as the argument for the next call to, i.e. it `carries' the pseudo-random linear congruential sequenceRANNk, and it should be kept in scope for as long asis used. It should be initialized before the first call toRANbut not modified thereafter except by the function itself.RAN2) If the integer parameter is initialized to the same value, the same sequence (both floating-point and integer) will be produced. If a different sequence is required for each run of a program it should be initialized to some `random' value, such as the output of a timer.

3) The integer parameter can be copied to another variable or used in expressions requiring random integers. The topmost bits are the most random. A random integer in the range [0,

L] can conveniently be produced by taking the remainder by (L+1) of the integer parameter shifted right by one bit. If the shift is not done an integer in the range [-L,L] will be produced.4) The modulus

Mis 2^(32) for single-length and 2^(64) for double-length, and the multipliers,a, have been chosen so that allMintegers will be produced before the sequence repeats. However several different integers can produce the same floating-point value and so a floating-point output may be repeated, although thesequenceof such will not be repeated untilMcalls have been made.5) The floating-point result is uniformly distributed over the output range, and the sequence passes various tests of randomness, such as the `run test', the `maximum of 5 test' and the `spectral test'.

6) The double-length version is slower to execute, but `more random' than the single-length version. If a highly-random sequence of single-length numbers is required, this could be produced by converting the output of

to single-length. Conversely if only a relatively crude sequence of double-length numbers is required,DRANcould be used for higher speed and its output converted to double-length.RAN