These files have been made freely available by SGS-Thomson and may not be used to generate commercial products without explicit permission and agreed licensing terms OR placed in a public archive or given to third parties without explicit written permission from SGS-Thomson in Bristol.
Tony Debling, SGS-Thomson Microelectronics Ltd, 1000 Aztec West, Almondsbury, Bristol BS12 4SQ, England. June 7, 1995.
Converted to HTML via RTFtoHTML, PERL5 and by hand by Dave Beckett <D.J.Beckett@ukc.ac.uk>.
WARNING: This has been converted from RTF format and the formulas and expressions have not been checked with the printed version. If you find any errors in this file, please send them to me, and I will update it.
Dave Beckett <D.J.Beckett@ukc.ac.uk>
Elementary maths and trigonometric functions are provided in three libraries, as follows:
Library Description snglmath.lib Single length library dblmath.lib Double length library tbmaths.lib TB optimized library
The single and double length libraries contain the same set of maths functions in single and double length forms. The double length forms all begin with the letter `D'. All function names are in upper case.
The TB optimized library is a combined single and double length library containing functions for the T4 series (T400, T414, T425, and T426). The functions have been optimized for speed. The standard single or double length libraries can be used on T4 processors but optimum performance will be achieved by using the TB optimized library. The accuracy of the T400/T414/T425/T426 optimized functions is similar to that of the standard single length functions but results returned may not be identical because different algorithms are used. If the optimized library is used in code compiled for any processor except a T400, T414, T425, or T426, the compiler reports an error.
To obtain the best possible speed performance with the occam maths functions use the following strategy:
The elementary function library is also described in appendix N of the occam 2 Reference manual.
These function subroutines have been written to be compatible with the ANSI standard for binary floating-point arithmetic (ANSI IEEE std 754-1985), as implemented in occam. They are based on the algorithms in:
Cody, W. J., and Waite, W. M. [1980]. Software Manual for the Elementary Functions. Prentice-Hall, New Jersey.
The only exceptions are the pseudo-random number generators, which are based on algorithms in:
Knuth, D. E. [1981]. The Art of Computer Programming, 2nd. edition, Volume 2: Seminumerical Algorithms. Addison-Wesley, Reading, Mass.
Error conditions are reported by means of three distinct NaNs:
The implementations will return the following values for these Not-a-Numbers:
Error Single length value Double length value undefined.NaN #7F800010 #7FF00002 00000000 unstable.NaN #7F800008 #7FF00001 00000000 inexact.NaN #7F800004 #7FF00000 80000000
For most of the functions the quoted error is for arguments in the primary domain, which represents the basic accuracy of the approximation. For some functions the process of range reduction results in a higher accuracy for arguments outside the primary domain, and for others it does the reverse. Refer to the notes on each function for more details.
If f is the mathematical function and F the subroutine approximation, then the relative error at the floating-point number X (provided f(X) is not zero) is:
Obviously the relative error may become very large near a zero of f(X). If the zero is at an irrational argument (which cannot be represented as a floating-point value), the absolute error is a better measure of the accuracy of the function near the zero.
As it is impractical to find the relative error for every possible argument, statistical measures of the overall error must be used. If the relative error is sampled at a number of points Xn (n = 1 to N), then useful statistics are the maximum relative error and the root-mean-square relative error:
Corresponding statistics can be formed for the absolute error also, and are called MAE and RMSAE respectively.
The MRE generally occurs near a zero of the function, especially if the true zero is irrational, or near a singularity where the result is large, since the `granularity' of the floating-point numbers then becomes significant.
A useful unit of relative error is the relative magnitude of the least significant bit in the floating-point fraction, which is called one `unit in the last place' (ulp), (i.e. the smallest e such that 1+e 01). Its magnitude depends on the floating-point format: for single-length it is 2^(-23) = 1.19*10^(-7), and for double-length it is 2^(-52 )= 2.22*10^(-16).
If the relative error in the argument X is d (typically a few ulp), then the absolute error (E) and relative error (e) in f(X) are:
This defines the absolute and relative error magnification factors A and R. When both are large the function is unstable, i.e. even a small error in the argument, such as would be produced by evaluating a floating-point expression, will cause a large error in the value of the function. The functions return an unstable.NaN in such cases which are simple to detect.
The functional forms of both A and R are given in the specification of each function.
In both cases the higher-precision implementation was used to approximate the mathematical function (called f above) in the computation of the error, which was evaluated in the higher precision to avoid rounding errors. Error statistics were produced according to the formulae above.
All single length functions except RAN return a single result of type REAL32, and all except RAN, POWER and ATAN2 have one parameter, a VAL REAL32.
POWER and ATAN2 have two parameters which are VAL REAL32s for the two arguments of each function.
RAN returns two results, of types REAL32 and INT32, and has one parameter which is a VAL INT32.
In each case the double-length version of name is called Dname, returns a REAL64 (except DRAN, which returns REAL64, INT64), and has parameters of type VAL REAL64 (VAL INT64 for DRAN).
For the standard single-length format it is 2^(-23) = 1.19*10^(-7).
For the double-length format it is 2^(-52) = 2.22*10^(-16).
This is also used as a measure of absolute error, since such errors can be considered `relative' to unity.
Where the range depends on the floating-point format, single-length is indicated with an S and double-length with a D.
For functions with two arguments the complete range of both arguments is given. This means that for each number in one range, there is at least one (though sometimes only one) number in the other range such that the pair of arguments is valid. Both ranges are shown, linked by an `x'.
Pi means the closest floating-point representation of the transcendental number p, ln(2) the closest representation of log e(2), and so on.
In describing the algorithms, `X' is used generically to designate the argument, and `result' (or RESULT, in the style of occam functions) to designate the output.
These two libraries will be efficient on processors with fast floating-point arithmetic and good support for the floating-point predefined functions such as MULBY2 and ARGUMENT.REDUCE. For 32-bit processors without special hardware for floating-point calculations the alternative optimized library described in section 1.4.3 using fixed-point arithmetic will be faster, but will not give identical results.
A special version has been produced for 16-bit transputers, which avoids the use of any double-precision arithmetic in the single precision functions. This is distinguished in the notes by the annotation `T2 special'; notes relating to the version for T8 and TB are denoted by `standard'.
Single and double length maths functions are listed below. Descriptions of the functions can be found in succeeding sections.
To use the single length library a program header must include the line
#USE "snglmath.lib"
To use the double length library a program header must include the line
#USE "dblmath.lib"
Result(s) Function Parameter specifiers REAL32 ALOG VAL REAL32 X REAL32 ALOG10 VAL REAL32 X REAL32 EXP VAL REAL32 X REAL32 POWER VAL REAL32 X, VAL REAL32 Y REAL32 SIN VAL REAL32 X REAL32 COS VAL REAL32 X REAL32 TAN VAL REAL32 X REAL32 ASIN VAL REAL32 X REAL32 ACOS VAL REAL32 X REAL32 ATAN VAL REAL32 X REAL32 ATAN2 VAL REAL32 X, VAL REAL32 Y REAL32 SINH VAL REAL32 X REAL32 COSH VAL REAL32 X REAL32 TANH VAL REAL32 X REAL32,INT32 RAN VAL INT32 X REAL64 DALOG VAL REAL64 X REAL64 DALOG10 VAL REAL64 X REAL64 DEXP VAL REAL64 X REAL64 DPOWER VAL REAL64 X, VAL REAL64 Y REAL64 DSIN VAL REAL64 X REAL64 DCOS VAL REAL64 X REAL64 DTAN VAL REAL64 X REAL64 DASIN VAL REAL64 X REAL64 DACOS VAL REAL64 X REAL64 DATAN VAL REAL64 X REAL64 DATAN2 VAL REAL64 X, VAL REAL64 Y REAL64 DSINH VAL REAL64 X REAL64 DCOSH VAL REAL64 X REAL64 DTANH VAL REAL64 X REAL64,INT64 DRAN VAL INT64 X
REAL32 FUNCTION ALOG (VAL REAL32 X)
REAL64 FUNCTION DALOG (VAL REAL64 X)
Compute loge(X).
Domain: (0, XMAX] Range: [MinLog, MaxLog] [MinLog, MaxLog] (See note 2) Primary Domain: [p2/2, p2) = [0.7071, 1.4142)
Exceptions
All arguments outside the domain generate an undefined.NaN.
Propagated Error
A 5 1, R +1/loge(X)
Generated Error
Primary Domain Error: MRE RMSRE Single Length(Standard): 1.7 ulp 0.43 ulp Single Length(T2 special): 1.6 ulp 0.42 ulp Double Length: 1.4 ulp 0.38 ulp
The Algorithm
Notes
1) The term ln(2) * N is much easier to compute (and more accurate) than LnF, and it is larger provided N is not 0 (i.e. for arguments outside the primary domain). Thus the accuracy of the result improves as the modulus of log(X) increases.2) The minimum value that can be produced, MinLog, is the logarithm of the smallest denormalized floating-point number. For single length MinLog is -103.28, and for double length it is -744.4. The maximum value MaxLog is the logarithm of XMAX. For single-length it is 88.72, and for double-length it is 709.78.
3) Since Inf is used to represent all values greater than XMAX its logarithm cannot be defined.
4) This function is well-behaved and does not seriously magnify errors in the argument.
ALOG10
DALOG10
REAL32 FUNCTION ALOG10 (VAL REAL32 X)
REAL64 FUNCTION DALOG10 (VAL REAL64 X)
Compute log10(X).
Domain: (0, XMAX] Range: [MinL10, MaxL10] (See note 2) Primary Domain: [p2/2, p2) = [0.7071, 1.4142)
Exceptions
All arguments outside the domain generate an undefined.NaN.
Propagated Error
A == log10(e), R = log10(e)/loge(X)
Generated Error
Primary Domain Error: MRE RMSRE Single Length (Standard): 1.70 ulp 0.45 ulp Single Length (T2 special): 1.71 ulp 0.46 ulp Double Length: 1.84 ulp 0.45 ulp
The Algorithm
1) See note 1 for ALOG.2) The minimum value that can be produced, MinL10, is the base 10 logarithm of the smallest denormalized floating-point number. For single length MinL10 is -44.85, and for double length it is -323.3. The maximum value MaxL10 is the base 10 logarithm of XMAX. For single length MaxL10 is 38.53, and for double-length it is 308.26.
3) Since Inf is used to represent all values greater than XMAX its logarithm cannot be defined.
4) This function is well-behaved and does not seriously magnify errors in the argument.
EXP
DEXP
REAL32 FUNCTION EXP (VAL REAL32 X)
REAL64 FUNCTION DEXP (VAL REAL64 X)
Compute e^(X).
Domain: [-Inf, MaxLog) = [-Inf, 88.72)S, [-Inf, 709.78)D Range: [0, Inf) (See note 4) Primary Domain: [-Ln2/2, Ln2/2) = [-0.3466, 0.3466)
Exceptions
All arguments outside the domain generate an Inf.
Propagated error
A = Xe^(X), R = X
Generated error
Primary Domain Error: MRE RMSRE Single Length(Standard): 0.99 ulp 0.25 ulp Single Length(T2 special): 1.0 ulp 0.25 ulp Double Length: 1.4 ulp 0.25 ulp
The Algorithm
1) MaxLog is log e(XMAX).2) For sufficiently negative arguments (below -87.34 for single-length and below -708.4 for double-length) the output is denormalized, and so the floating-point number contains progressively fewer significant digits, which degrades the accuracy. In such cases the error can theoretically be a factor of two.
3) Although the true exponential function is never zero, for large negative arguments the true result becomes too small to be represented as a floating-point number, and EXP underflows to zero. This occurs for arguments below -103.9 for single-length, and below -745.2 for double-length.
4) The propagated error is considerably magnified for large positive arguments, but diminished for large negative arguments.
POWER
DPOWER
REAL32 FUNCTION POWER (VAL REAL32 X, Y)
REAL64 FUNCTION DPOWER (VAL REAL64 X, Y)
Compute X^(Y).
Domain: [0, Inf] x [-Inf, Inf] Range: (-Inf, Inf) Primary Domain: See note 3.
Exceptions
If the first argument is outside its domain, undefined.NaN is returned. If the true value of X^(Y) exceeds XMAX, Inf is returned. In certain other cases other NaNs are produced: See note 2.
Propagated Error
A = YX^(Y)(1 [[formula]] log e(X)), R = Y(1 [[formula]] log e(X)) (See note 4)
Generated error
Example Range Error: MRE RMSRE (See note 3) Single Length(Standard): 1.0 ulp 0.25 ulp Single Length(T2 special): 63.1 ulp 13.9 ulp Double Length: 21.1 ulp 2.4 ulp
The Algorithm
Deal with special cases: either argument = 1, 0, +Inf or -Inf (see note 2). Otherwise:(a) For the standard single precision:
(b) For double precision, and the single precision special version:
1) This subroutine implements the mathematical function x^(y) to a much greater accuracy than can be attained using the ALOG and EXP functions, by performing each step in higher precision. The single-precision version is more efficient than using DALOG and EXP because redundant tests are omitted.2) Results for special cases are as follows:
First Input (X) Second Input (Y) Result t0 ANY undefined.NaN 0 v0 undefined.NaN 0 0 <Y<=XMAX 0 0 Inf unstable.NaN 0 t X t1 Inf 0 0 t X t1 -Inf Inf 1 -XMAX<=Y<=XMAX 1 1 " Inf unstable.NaN 1 t X v XMAX Inf Inf 1 t X v XMAX -Inf 0 Inf 1 v Y v Inf Inf Inf -Inf vY v -1 0 Inf -1 t Y t1 undefined.NaN otherwise 0 1 otherwise 1 X
3) Performing all the calculations in extended precision makes the double-precision algorithm very complex in detail, and having two arguments makes a primary domain difficult to specify. As an indication of accuracy, the functions were evaluated at 100,000 points logarithmically distributed over (0.1, 10.0), with the exponent linearly distributed over (-35.0, 35.0) (single-length), and (-300.0, 300.0) (double-length), producing the errors given above. The errors are much smaller if the exponent range is reduced.4) The error amplification factors are calculated on the assumption that the relative error in Y is [[formula]] that in X, otherwise there would be separate factors for both X and Y. It can be seen that the propagated error will be greatly amplified whenever loge(X) or Y is large.
SIN
DSIN
REAL32 FUNCTION SIN (VAL REAL32 X)
REAL64 FUNCTION DSIN (VAL REAL64 X)
Compute sine(X) (where X is in radians).
Domain: [-Smax, Smax] = [-205887.4, 205887.4]S (Standard),
= [-4.2*106, 4.2*106]S (T2 special)
= [-4.29*109, 4.29*109]D Range: [-1.0, 1.0] Primary Domain: [-Pi/2, Pi/2]= [-1.57, 1.57]
Exceptions
All arguments outside the domain generate an inexact.NaN, except [[formula]] Inf, which generates an undefined.NaN.
Propagated Error
A = X cos(X), R = X cot(X)
Generated error (See note 1)
Primary Domain [0, 2Pi] MRE RMSRE MAE RMSAE Single Length(Standard): 0.94 ulp 0.23 ulp 0.96 ulp 0.19 ulp Single Length(T2 special): 0.92 ulp 0.23 ulp 0.94 ulp 0.19 ulp Double Length: 0.90 ulp 0.22 ulp 0.91 ulp 0.18 ulp
The Algorithm
Notes1) For arguments outside the primary domain the accuracy of the result depends crucially on step 2. The extra precision of step 2 is lost if N becomes too large, and the cut-off Smax is chosen to prevent this. In any case for large arguments the `granularity' of floating-point numbers becomes a significant factor. For arguments larger than Smax a change in the argument of 1 ulp would change more than half of the significant bits of the result, and so the result is considered to be essentially indeterminate.
2) The propagated error has a complex behavior. The propagated relative error becomes large near each zero of the function (outside the primary range), but the propagated absolute error only becomes large for large arguments. In effect, the error is seriously amplified only in an interval about each irrational zero, and the width of this interval increases roughly in proportion to the size of the argument.
3) Since only the remainder of X by Pi is used in step 3, the symmetry
sin(x+ np ) = [[formula]] sin (x) is preserved, although there is a complication due to differing precision representations of p.4) The output range is not exceeded. Thus the output of SIN is always a valid argument for ASIN.
COS
DCOS
REAL32 FUNCTION COS (VAL REAL32 X)
REAL64 FUNCTION DCOS (VAL REAL64 X)
Compute cosine(X) (where X is in radians).
Domain: [-Cmax, Cmax] = [-205887.4, 205887.4]S (Standard),
= [-12868.0, 12868.0]S (T2 special)
= [-2.1*108, 2.1*108]D Range: [-1.0, 1.0] Primary Domain: See note 1.
Exceptions
All arguments outside the domain generate an inexact.NaN, except [[formula]]Inf, which generates an undefined.NaN.
Propagated Error
A = -X sin (X), R = -X tan (X) (See note 4)
Generated error
[0, Pi/4) [0, 2Pi] MRE RMSRE MAE RMSAE
Single Length(Standard): 0.93 ulp 0.25 ulp 0.88 ulp 0.18 ulp Single Length(T2 special): 1.1 ulp 0.3 ulp 0.94 ulp 0.19 ulp Double Length: 1.0 ulp 0.28 ulp 0.90 ulp 0.19 ulp
The Algorithm
Notes1) Inspection of the algorithm shows that argument reduction always occurs, thus there is no `primary domain' for COS. So for all arguments the accuracy of the result depends crucially on step 2. The standard single-precision version performs the argument reduction in double-precision, so there is effectively no loss of accuracy at this step. For the T2 special version and the double-precision version there are effectively K extra bits in the representation of p(K=8 for the former and 12 for the latter). If the argument agrees with an odd integer multiple of p/2 to more than k bits there is a loss of significant bits from the computed remainder equal to the number of extra bits of agreement, and this causes a loss of accuracy in the result.
2) The difference between COS evaluated at successive floating-point numbers is given approximately by the absolute error amplification factor, A. For arguments larger than Cmax this difference may be more than half the significant bits of the result, and so the result is considered to be essentially indeterminate and an inexact.NaN is returned. The extra precision of step 2 in the double-precision and T2 special versions is lost if N becomes too large, and the cut-off at Cmax prevents this also.
3) For small arguments the errors are not evenly distributed. As the argument becomes smaller there is an increasing bias towards negative errors (which is to be expected from the form of the Taylor series). For the single-length version and X in [-0.1, 0.1], 62% of the errors are negative, whilst for X in [-0.01, 0.01], 70% of them are.
4) The propagated error has a complex behavior. The propagated relative error becomes large near each zero of the function, but the propagated absolute error only becomes large for large arguments. In effect, the error is seriously amplified only in an interval about each irrational zero, and the width of this interval increases roughly in proportion to the size of the argument.
5) Since only the remainder of (|X|+Pi/2) by Pi is used in step 3, the symmetry cos(x+ np ) = [[formula]] cos(x) is preserved. Moreover, since the same rational approximation is used as in SIN, the relation cos(x) = sin(x+p/2) is also preserved. However, in each case there is a complication due to the different precision representations of p.
6) The output range is not exceeded. Thus the output of COS is always a valid argument for ACOS.
TAN
DTAN
REAL32 FUNCTION TAN (VAL REAL32 X)
REAL64 FUNCTION DTAN (VAL REAL64 X)
Compute tan(X) (where X is in radians).
Domain: [-Tmax, Tmax] = [-102943.7, 102943.7]S(Standard),
= [-2.1*106, 2.1*106]S(T2 special),
= [-2.1*109, 2.1*109]D Range: (-Inf, Inf) Primary Domain: [-Pi/4, Pi/4]= [-0.785, 0.785]
Exceptions
All arguments outside the domain generate an inexact.NaN, except [[formula]] Inf, which generate an undefined.NaN. Odd integer multiples of p/2 may produce unstable.NaN.
Propagated Error
A = X(1+ tan^(2)(X)), R = X(1+ tan^(2) (X))/tan(X) (See note 3)
Generated error
Primary Domain Error: MRE RMSRE (See note 3) Single Length(Standard): 1.44 ulp 0.39 ulp Single Length(T2 special): 1.37 ulp 0.39 ulp Double Length: 1.27 ulp 0.35 ulp
The Algorithm
1) R is large whenever X is near to an integer multiple of p/2, and so tan is very sensitive to small errors near its zeros and singularities. Thus for arguments outside the primary domain the accuracy of the result depends crucially on step 2, so this is performed with very high precision, using double precision Pi/2 for the standard single-precision function and two double-precision floating-point numbers for the representation of p /2 for the double-precision function. The T2 special version uses two single-precision floating-point numbers. The extra precision is lost if N becomes too large, and the cut-off Tmax is chosen to prevent this.2) The difference between TAN evaluated at successive floating-point numbers is given approximately by the absolute error amplification factor, A. For arguments larger than Smax this difference could be more than half the significant bits of the result, and so the result is considered to be essentially indeterminate and an inexact.NaN is returned.
3) Tan is quite badly behaved with respect to errors in the argument. Near its zeros outside the primary domain the relative error is greatly magnified, though the absolute error is only proportional to the size of the argument. In effect, the error is seriously amplified in an interval about each irrational zero, whose width increases roughly in proportion to the size of the argument. Near its singularities both absolute and relative errors become large, so any large output from this function is liable to be seriously contaminated with error, and the larger the argument, the smaller the maximum output which can be trusted. If step 3 of the algorithm requires division by zero, an unstable.NaN is produced instead.
4) Since only the remainder of X by Pi/2 is used in step 3, the symmetry tan(x+ np ) = tan(x) is preserved, although there is a complication due to the differing precision representations of p. Moreover, by step 3 the symmetry tan(x) = 1/tan( p/2 - x) is also preserved.
ASIN
DASIN
REAL32 FUNCTION ASIN (VAL REAL32 X)
REAL64 FUNCTION DASIN (VAL REAL64 X)
Compute sine^(-1)(X) (in radians).
Domain: [-1.0, 1.0] Range: [-Pi/2, Pi/2] Primary Domain: [-0.5, 0.5]
Exceptions
All arguments outside the domain generate an undefined.NaN.
Propagated Error
A = X/p1 - X^(2), R = X/(sin^(-1)(X) p1 - X^(2))
Generated Error
Primary Domain [-1.0, 1.0] MRE RMSRE MAE RMSAE
Single Length: 0.58 ulp 0.21 ulp 1.35 ulp 0.33 ulp Double Length: 0.59 ulp 0.21 ulp 1.26 ulp 0.27 ulp
The Algorithm
1) The error amplification factors are large only near the ends of the domain. Thus there is a small interval at each end of the domain in which the result is liable to be contaminated with error: however since both domain and range are bounded the absolute error in the result cannot be large.2) By step 1, the identity sin^(-1)(x) = p/2 - 2 sin^(-1)(p(1-x)/2)) is preserved.
ACOS
DACOS
REAL32 FUNCTION ACOS (VAL REAL32 X)
REAL64 FUNCTION DACOS (VAL REAL64 X)
Compute cosine^(-1)(X) (in radians).
Domain: [-1.0, 1.0] Range: [0, Pi] Primary Domain: [-0.5, 0.5]
Exceptions
All arguments outside the domain generate an undefined.NaN.
Propagated Error
A = -X/p1 - X^(2), R = -X/(sin^(-1)(X) p1 - X^(2))
Generated Error
Primary Domain [-1.0, 1.0] MRE RMSRE MAE RMSAE Single Length: 1.06 ulp 0.38 ulp 2.37 ulp 0.61 ulp Double Length: 0.96 ulp 0.32 ulp 2.25 ulp 0.53 ulp
The Algorithm
1) The error amplification factors are large only near the ends of the domain. Thus there is a small interval at each end of the domain in which the result is liable to be contaminated with error, although this interval is larger near 1 than near -1, since the function goes to zero with an infinite derivative there. However since both the domain and range are bounded the absolute error in the result cannot be large.2) Since the rational approximation is the same as that in ASIN, the relation cos^(-1)(x) = p/2 - sin^(-1)(x) is preserved.
ATAN
DATAN
REAL32 FUNCTION ATAN (VAL REAL32 X)
REAL64 FUNCTION DATAN (VAL REAL64 X)
Compute tan^(-1)(X) (in radians).
Domain: [-Inf, Inf] Range: [-Pi/2, Pi/2] Primary Domain: [-z, z], z = 2 - p3 = 0.2679
Exceptions
None.
Propagated Error
A = X/(1 + X^(2)), R = X/(tan^(-1)(X)(1 + X^(2)))
Generated Error
Primary Domain Error: MRE RMSRE Single Length: 0.56 ulp 0.21 ulp Double Length: 0.52 ulp 0.21 ulp
The Algorithm
1) For |X| > ATmax, |tan^(-1)(X)| is indistinguishable from p/2 in the floating-point format. For single-length, ATmax = 1.68*10^(7), and for double-length ATmax = 9*10^(15), approximately.2) This function is numerically very stable, despite the complicated argument reduction. The worst errors occur just above 2-p3, but are no more than 3.2 ulp.
3) It is also very well behaved with respect to errors in the argument, i.e. the error amplification factors are always small.
4) The argument reduction scheme ensures that the identities tan^(-1)(X) = p/2 - tan^(-1)(1/X), and tan^(-1)(X) = p/6 + tan^(-1)((p3*X-1)/(p3 + X)) are preserved.
ATAN2
DATAN2
REAL32 FUNCTION ATAN2 (VAL REAL32 X, Y)
REAL64 FUNCTION DATAN2 (VAL REAL64 X, Y)
Compute the angular co-ordinate tan^(-1)(Y/X) (in radians) of a point whose X and Y co-ordinates are given.
Domain: [-Inf, Inf] x [-Inf, Inf] Range: (-Pi, Pi] Primary Domain: See note 2.
Exceptions
(0, 0) and ([[formula]]Inf,[[formula]]Inf) give undefined.NaN.
Propagated Error
A = X(1 [[formula]] Y)/(X^(2) + Y^(2)), R= X(1 [[formula]] Y)/(tan^(-1)(Y/X)(X^(2) + Y^(2))) (See note 3)
Generated Error
See note 2.
The Algorithm
1) This two-argument function is designed to perform rectangular-to-polar co-ordinate conversion.2) See the notes for ATAN for the primary domain and estimates of the generated error.
3) The error amplification factors were derived on the assumption that the relative error in Y is [[formula]] that in X, otherwise there would be separate factors for X and Y. They are small except near the origin, where the polar co-ordinate system is singular.
SINH
DSINH
REAL32 FUNCTION SINH (VAL REAL32 X)
REAL64 FUNCTION DSINH (VAL REAL64 X)
Compute sinh(X).
Domain: [-Hmax, Hmax] = [-89.4, 89.4]S, [-710.5, 710.5]D Range: (-Inf, Inf) Primary Domain: (-1.0, 1.0)
Exceptions
X < -Hmax gives -Inf, and X > Hmax gives Inf.
Propagated Error
A = Xcosh(X), R = Xcoth(X) (See note 3)
Generated Error
Primary Domain [1.0, XBig] (See note 2) MRE RMSRE MAE RMSAE Single Length: 0.91 ulp 0.26 ulp 1.41 ulp 0.34 ulp Double Length: 0.67 ulp 0.22 ulp 1.31 ulp 0.33 ulp
The Algorithm
1) Hmax is the point at which sinh(X) becomes too large to be represented in the floating-point format.2) XBig is the point at which e^(-)^(|X | )becomes insignificant compared with e^(|X |), (in floating-point). For single-length it is 8.32, and for double-length it is 18.37.
3) This function is quite stable with respect to errors in the argument. Relative error is magnified near zero, but the absolute error is a better measure near the zero of the function and it is diminished there. For large arguments absolute errors are magnified, but since the function is itself large, relative error is a better criterion, and relative errors are not magnified unduly for any argument in the domain, although the output does become less reliable near the ends of the range.
COSH
DCOSH
REAL32 FUNCTION COSH (VAL REAL32 X)
REAL64 FUNCTION DCOSH (VAL REAL64 X)
Compute cosh(X).
Domain: [-Hmax, Hmax] = [-89.4, 89.4]S, [-710.5, 710.5]D Range: [1.0, Inf) Primary Domain: [-XBig, XBig]= [-8.32, 8.32]S
= [-18.37, 18.37]D
Exceptions
|X| > Hmax gives Inf.
Propagated Error
A = X sinh(X) , R = X tanh(X) (See note 3)
Generated Error
Primary Domain Error: MRE RMS Single Length: 1.24 ulp 0.32 ulp Double Length: 1.24 ulp 0.33 ulp
The Algorithm
1) Hmax is the point at which cosh(X) becomes too large to be represented in the floating-point format.2) XBig is the point at which e^(-)^(|X | )becomes insignificant compared with e^(|X | )(in floating-point).
3) Errors in the argument are not seriously magnified by this function, although the output does become less reliable near the ends of the range.
TANH
DTANH
REAL32 FUNCTION TANH (VAL REAL32 X)
REAL64 FUNCTION DTANH (VAL REAL64 X)
Compute tanh(X).
Domain: [-Inf, Inf] Range: [-1.0, 1.0] Primary Domain: [-Log(3)/2, Log(3)/2] = [-0.549, 0.549]
Exceptions
None.
Propagated Error
A = X/cosh^(2)(X), R = X/sinh(X) cosh(X)
Generated Error
Primary Domain Error: MRE RMS Single Length: 0.53 ulp 0.2 ulp Double Length: 0.53 ulp 0.2 ulp
The Algorithm
1) As a floating-point number, tanh(X) becomes indistinguishable from its asymptotic values of [[formula]]1.0 for |X|> HTmax, where HTmax is 8.4 for single-length, and 19.06 for double-length. Thus the output of TANH is equal to [[formula]]1.0 for such X.2) This function is very stable and well-behaved, and errors in the argument are always diminished by it.
RAN
DRAN
REAL32,INT32 FUNCTION RAN (VAL INT32 X)
REAL64,INT64 FUNCTION DRAN (VAL INT64 X)
These produce a pseudo-random sequence of integers, or a corresponding sequence of floating-point numbers between zero and one. X is the seed integer that initiates the sequence.
Domain: Integers (see note 1) Range: [0.0, 1.0] x Integers
Exceptions
None.
The Algorithm
1) This function has two results, the first a real, and the second an integer (both 32 bits for single-length, and 64 bits for double-length). The integer is used as the argument for the next call to RAN, i.e. it `carries' the pseudo-random linear congruential sequence Nk, and it should be kept in scope for as long as RAN is used. It should be initialized before the first call to RAN but not modified thereafter except by the function itself.2) If the integer parameter is initialized to the same value, the same sequence (both floating-point and integer) will be produced. If a different sequence is required for each run of a program it should be initialized to some `random' value, such as the output of a timer.
3) The integer parameter can be copied to another variable or used in expressions requiring random integers. The topmost bits are the most random. A random integer in the range [0,L] can conveniently be produced by taking the remainder by (L+1) of the integer parameter shifted right by one bit. If the shift is not done an integer in the range [-L,L] will be produced.
4) The modulus M is 2^(32) for single-length and 2^(64) for double-length, and the multipliers, a, have been chosen so that all M integers will be produced before the sequence repeats. However several different integers can produce the same floating-point value and so a floating-point output may be repeated, although the sequence of such will not be repeated until M calls have been made.
5) The floating-point result is uniformly distributed over the output range, and the sequence passes various tests of randomness, such as the `run test', the `maximum of 5 test' and the `spectral test'.
6) The double-length version is slower to execute, but `more random' than the single-length version. If a highly-random sequence of single-length numbers is required, this could be produced by converting the output of DRAN to single-length. Conversely if only a relatively crude sequence of double-length numbers is required, RAN could be used for higher speed and its output converted to double-length.
#USE "tbmaths.lib"
The version of the library described by this section has been written for 32-bit processors without hardware for floating-point arithmetic. Functions from it will give results very close, but not identical to, those produced by the corresponding functions from the single and double length libraries.
This is the version specifically intended to derive maximum performance from the IMS T400, T414, T425, and T426 processors. The single-precision functions make use of the FMUL instruction available on 32-bit processors without floating-point hardware. The library is compiled for transputer class TB.
The tables and notes at the beginning of section 1.4 apply equally here. However all the functions are contained in one library.
REAL32 FUNCTION ALOG (VAL REAL32 X)
REAL64 FUNCTION DALOG (VAL REAL64 X)
These compute: loge(X)
Domain: (0, XMAX] Range: [MinLog, MaxLog] (See note 2) Primary Domain: [p2/2, p2) = [0.7071, 1.4142)
Exceptions
All arguments outside the domain generate an undefined.NaN.
Propagated Error
A 51, R = loge(X)
Generated Error
Primary Domain Error: MRE RMSRE Single Length: 1.19 ulp 0.36 ulp Double Length: 2.4 ulp 1.0 ulp
The Algorithm
1) The term ln(2) * N is much easier to compute (and more accurate) than LnF, and it is larger provided N is not 0 (i.e. for arguments outside the primary domain). Thus the accuracy of the result improves as the modulus of log(X) increases.2) The minimum value that can be produced, MinLog, is the logarithm of the smallest denormalized floating-point number. For single length MinLog is -103.28, and for double length it is -744.4. The maximum value MaxLog is the logarithm of XMAX. For single-length it is 88.72, and for double-length it is 709.78.
3) Since Inf is used to represent all values greater than XMAX its logarithm cannot be defined.
4) This function is well-behaved and does not seriously magnify errors in the argument.
ALOG10
DALOG10
REAL32 FUNCTION ALOG10 (VAL REAL32 X)
REAL64 FUNCTION DALOG10 (VAL REAL64 X)
These compute: log10(X)
Domain: (0, XMAX] Range: [MinL10, MaxL10] (See note 2) Primary Domain: [p2/2, p2) = [0.7071, 1.4142)
Exceptions
All arguments outside the domain generate an undefined.NaN.
Propagated Error
A 5 log10(e), R = log10(e)/loge(X)
Generated Error
Primary Domain Error: MRE RMSRE Single Length: 1.43 ulp 0.39 ulp Double Length: 2.64 ulp 0.96 ulp
The Algorithm
1) See note 1 for ALOG.2) The minimum value that can be produced, MinL10, is the base 10 logarithm of the smallest denormalized floating-point number. For single length MinL10 is -44.85, and for double length it is -323.3. The maximum value MaxL10 is the base 10 logarithm of XMAX. For single length MaxL10 is 38.53, and for double-length it is 308.26.
3) Since Inf is used to represent all values greater than XMAX its logarithm cannot be defined.
4) This function is well-behaved and does not seriously magnify errors in the argument.
EXP
DEXP
REAL32 FUNCTION EXP (VAL REAL32 X)
REAL64 FUNCTION DEXP (VAL REAL64 X)
These compute: e^(X)
Domain: [-Inf, MaxLog) = [-Inf, 88.03)S, [-Inf, 709.78)D Range: [0, Inf) (See note 4) Primary Domain: [-Ln2/2, Ln2/2) = [-0.3466, 0.3466)
Exceptions
All arguments outside the domain generate an Inf.
Propagated Error
A = Xe^(X), R = X
Generated Error
Primary Domain Error: MRE RMSRE Single Length: 0.51 ulp 0.21 ulp Double Length: 0.5 ulp 0.21 ulp
The Algorithm
1) MaxLog is loge(XMAX).2) The analytical properties of e^(x) make the relative error of the result proportional to the absolute error of the argument. Thus the accuracy of step 2, which prepares the argument for the rational approximation, is crucial to the performance of the subroutine. It is completely accurate when N = 0, i.e. in the primary domain, and becomes less accurate as the magnitude of N increases. Since N can attain larger negative values than positive ones, EXP is least accurate for large, negative arguments.
3) For sufficiently negative arguments (below -87.34 for single-length and below -708.4 for double-length) the output is denormalized, and so the floating-point number contains progressively fewer significant digits, which degrades the accuracy. In such cases the error can theoretically be a factor of two.
4) Although the true exponential function is never zero, for large negative arguments the true result becomes too small to be represented as a floating-point number, and EXP underflows to zero. This occurs for arguments below -103.9 for single-length, and below -745.2 for double-length.
5) The propagated error is considerably magnified for large positive arguments, but diminished for large negative arguments.
POWER
DPOWER
REAL32 FUNCTION POWER (VAL REAL32 X, Y)
REAL32 FUNCTION DPOWER (VAL REAL64 X, Y)
These compute: X^(Y)
Domain: [0, Inf] x [-Inf, Inf] Range: (-Inf, Inf) Primary Domain: See note 3.
Exceptions
If the first argument is outside its domain, undefined.NaN is returned. If the true value of X^(Y) exceeds XMAX, Inf is returned. In certain other cases other NaNs are produced: See note 2.
Propagated Error
A = YX^(Y)(1 [[formula]] loge(X)), R = Y(1 [[formula]] loge(X)) (See note 4)
Generated Error
Example Range Error: MRE RMSRE (See note 3) Single Length: 1.0 ulp 0.24 ulp Double Length: 13.2 ulp 1.73 ulp
The Algorithm
Deal with special cases: either argument = 1, 0, +Inf or -Inf (see note 2). Otherwise:(a) For single precision:
(b) For double precision:
1) This subroutine implements the mathematical function x^(y) to a much greater accuracy than can be attained using the ALOG and EXP functions, by performing each step in higher precision.2) Results for special cases are as follows:
First Input (X) Second Input (Y) Result t0 ANY undefined.NaN 0 v0 undefined.NaN 0 0 <Y<=XMAX 0 0 Inf unstable.NaN 0 t X t1 Inf 0 0 t X t1 -Inf Inf 1 -XMAX<=Y<=XMAX 1 1 " Inf unstable.NaN 1 t X v XMAX Inf Inf 1 t X v XMAX -Inf 0 Inf 1 v Y v Inf Inf Inf -Inf vY v -1 0 Inf -1 t Y t1 undefined.NaN otherwise 0 1 otherwise 1 X
3) Performing all the calculations in extended precision makes the double-precision algorithm very complex in detail, and having two arguments makes a primary domain difficult to specify. As an indication of accuracy, the functions were evaluated at 100,000 points logarithmically distributed over (0.1, 10.0), with the exponent linearly distributed over (-35.0, 35.0) (single-length), and (-300.0, 300.0) (double-length), producing the errors given above. The errors are much smaller if the exponent range is reduced.4) The error amplification factors are calculated on the assumption that the relative error in Y is [[formula]] that in X, otherwise there would be separate factors for both X and Y. It can be seen that the propagated error will be greatly amplified whenever log e(X) or Y is large.
The Algorithm
REAL32 FUNCTION SIN (VAL REAL32 X)
REAL64 FUNCTION DSIN (VAL REAL64 X)
These compute: sine(X) (where X is in radians)
Domain: [-Smax, Smax] = [-12868.0, 12868.0]S,
= [-2.1*108, 2.1*108]D Range: [-1.0, 1.0] Primary Domain: [-Pi/2, Pi/2] = [-1.57, 1.57]
Exceptions
All arguments outside the domain generate an inexact.NaN, except [[formula]] Inf, which generates an undefined.NaN.Propagated Error
A = X cos(X), R = X cot(X)
Generated Error (See note 3)
Range: Primary Domain [0, 2Pi] MRE RMSRE MAE RMSAE Single Length: 0.65 ulp 0.22 ulp 0.74 ulp 0.18 ulp Double Length: 0.56 ulp 0.21 ulp 0.64 ulp 0.16 ulp
The Algorithm
1) For arguments outside the primary domain the accuracy of the result depends crucially on step 2. The extended precision corresponds to K extra bits in the representation of p (K = 8 for single-length and 12 for double-length). If the argument agrees with an integer multiple of p to more than K bits there is a loss of significant bits in the remainder, equal to the number of extra bits of agreement, and this causes a loss of accuracy in the result.2) The extra precision of step 2 is lost if N becomes too large, and the cut-off Smax is chosen to prevent this. In any case for large arguments the `granularity' of floating-point numbers becomes a significant factor. For arguments larger than Smax a change in the argument of 1 ulp would change more than half of the significant bits of the result, and so the result is considered to be essentially indeterminate.
3) The propagated error has a complex behavior. The propagated relative error becomes large near each zero of the function (outside the primary range), but the propagated absolute error only becomes large for large arguments. In effect, the error is seriously amplified only in an interval about each irrational zero, and the width of this interval increases roughly in proportion to the size of the argument.
4) Since only the remainder of X by Pi is used in step 3, the symmetry sin(x+ np ) = [[formula]] sin(x) is preserved, although there is a complication due to differing precision representations of p.
5) The output range is not exceeded. Thus the output of SIN is always a valid argument for ASIN.
COS
DCOS
REAL32 FUNCTION COS (VAL REAL32 X)
REAL64 FUNCTION DCOS (VAL REAL64 X)
These compute: cosine (X) (where X is in radians)
Domain: [-Smax, Smax] = [-12868.0, 12868.0]S,
= [-2.1*108, 2.1*108]D Range: [-1.0, 1.0] Primary Domain: See note 1.
Exceptions
All arguments outside the domain generate an inexact.NaN, except [[formula]]Inf, which generates an undefined.NaN.
Propagated Error
A = -X sin(X), R = -X tan(X) (See note 4)
Generated Error
Range: [0,Pi/4) [0, 2Pi] MRE RMSRE MAE RMSAE Single Length: 1.0 ulp 0.28 ulp 0.81 ulp 0.17 ulp Double Length: 0.93 ulp 0.26 ulp 0.76 ulp 0.18 ulp
The Algorithm
1) Inspection of the algorithm shows that argument reduction always occurs, thus there is no `primary domain' for COS. So for all arguments the accuracy of the result depends crucially on step 2. The extended precision corresponds to K extra bits in the representation of p (K = 8 for single-length and 12 for double length). If the argument agrees with an odd integer multiple of p/2 to more than K bits there is a loss of significant bits in the remainder, equal to the number of extra bits of agreement, and this causes a loss of accuracy in the result.2) The extra precision of step 2 is lost if N becomes too large, and the cut-off Smax is chosen to prevent this. In any case for large arguments the `granularity' of floating-point numbers becomes a significant factor. For arguments larger than Smax a change in the argument of 1 ulp would change more than half of the significant bits of the result, and so the result is considered to be essentially indeterminate.
3) For small arguments the errors are not evenly distributed. As the argument becomes smaller there is an increasing bias towards negative errors (which is to be expected from the form of the Taylor series). For the single-length version and X in [-0.1, 0.1], 62% of the errors are negative, whilst for X in [-0.01, 0.01], 70% of them are.
4) The propagated error has a complex behavior. The propagated relative error becomes large near each zero of the function, but the propagated absolute error only becomes large for large arguments. In effect, the error is seriously amplified only in an interval about each irrational zero, and the width of this interval increases roughly in proportion to the size of the argument.
5) Since only the remainder of (|X|+Pi/2) by Pi is used in step 3, the symmetry cos(x+ np ) = [[formula]] cos(x) is preserved. Moreover, since the same rational approximation is used as in SIN, the relation cos(x) =sin(x+ p/2) is also preserved. However, in each case there is a complication due to the different precision representations of p.
6) The output range is not exceeded. Thus the output of COS is always a valid argument for ACOS.
TAN
DTAN
REAL32 FUNCTION TAN (VAL REAL32 X)
REAL64 FUNCTION DTAN (VAL REAL64 X)
These compute: tan(X) (where X is in radians)
Domain: [-Tmax, Tmax] = [-6434.0, 6434.0]S = [-1.05*108, 1.05*108]D Range: (-Inf, Inf) Primary Domain: [-Pi/4, Pi/4] = [-0.785, 0.785]
Exceptions
All arguments outside the domain generate an inexact.NaN, except [[formula]] Inf, which generate an undefined.NaN. Odd integer multiples of p/2 may produce unstable.NaN.
Propagated Error
A = X(1+ tan^(2)(X)), R = X(1+ tan^(2)(X))/tan(X) (See note 4)
Generated Error
Primary Domain Error: MRE RMSRE Single Length: 3.5 ulp 0.23 ulp Double Length: 0.69 ulp 0.23 ulp
The Algorithm
1) R is large whenever X is near to an integer multiple of p/2, and so tan is very sensitive to small errors near its zeros and singularities. Thus for arguments outside the primary domain the accuracy of the result depends crucially on step 2. The extended precision corresponds to K extra bits in the representation of p/2 (K = 8 for single-length and 12 for double-length). If the argument agrees with an integer multiple of p/2 to more than K bits there is a loss of significant bits in the remainder, approximately equal to the number of extra bits of agreement, and this causes a loss of accuracy in the result.2) The extra precision of step 2 is lost if N becomes too large, and the cut-off Tmax is chosen to prevent this. In any case for large arguments the `granularity' of floating-point numbers becomes a significant factor. For arguments larger than Tmax a change in the argument of 1 ulp would change more than half of the significant bits of the result, and so the result is considered to be essentially indeterminate.
3) Step 3 of the algorithm has been slightly modified in the double-precision version from that given in Cody & Waite to avoid fixed-point underflow in the polynomial evaluation for small arguments.
4) Tan is quite badly behaved with respect to errors in the argument. Near its zeros outside the primary domain the relative error is greatly magnified, though the absolute error is only proportional to the size of the argument. In effect, the error is seriously amplified in an interval about each irrational zero, whose width increases roughly in proportion to the size of the argument. Near its singularities both absolute and relative errors become large, so any large output from this function is liable to be seriously contaminated with error, and the larger the argument, the smaller the maximum output which can be trusted. If step 4 of the algorithm requires division by zero, an unstable.NaN is produced instead.
5) Since only the remainder of X by Pi/2 is used in step 3, the symmetry tan(x+ np) = tan(x) is preserved, although there is a complication due to the differing precision representations of p. Moreover, by step 4 the symmetry tan(x) = 1/ tan( p/2 - x) is also preserved.
ASIN
DASIN
REAL32 FUNCTION ASIN (VAL REAL32 X)
REAL64 FUNCTION DASIN (VAL REAL64 X)
These compute: sin^(-1)(X) (in radians)
Domain: [-1.0, 1.0] Range: [-Pi/2, Pi/2] Primary Domain: [-0.5, 0.5]
Exceptions
All arguments outside the domain generate an undefined.NaN.
Propagated Error
A = X/p1 - X^(2), R = X/(sin^(-1)(X) p1 - X^(2))
Generated Error
Primary Domain [-1.0, 1.0] MRE RMSRE MAE RMSAE Single Length: 0.53 ulp 0.21 ulp 1.35 ulp 0.33 ulp Double Length: 2.8 ulp 1.4 ulp 2.34 ulp 0.64 ulp
The Algorithm
1) The error amplification factors are large only near the ends of the domain. Thus there is a small interval at each end of the domain in which the result is liable to be contaminated with error: however since both domain and range are bounded the absolute error in the result cannot be large.2) By step 1, the identity sin^(-1)(x) = p/2 - 2 sin^(-1)(p(1-x)/2)) is preserved.
ACOS
DACOS
REAL32 FUNCTION ACOS (VAL REAL32 X)
REAL64 FUNCTION DACOS (VAL REAL64 X)
These compute: cosine^(-1)(X) (in radians)
Domain: [-1.0, 1.0] Range: [0, Pi] Primary Domain: [-0.5, 0.5]
Exceptions
All arguments outside the domain generate an undefined.NaN.
Propagated Error
A = -X/p1 - X^(2), R = -X/(sin^(-1)(X) p1 - X^(2))
Generated Error
Primary Domain [-1.0, 1.0] MRE RMSRE MAE RMSAE Single Length: 1.1 ulp 0.38 ulp 2.4 ulp 0.61 ulp Double Length: 1.3 ulp 0.34 ulp 2.9 ulp 0.78 ulp
The Algorithm
1) The error amplification factors are large only near the ends of the domain. Thus there is a small interval at each end of the domain in which the result is liable to be contaminated with error, although this interval is larger near 1 than near -1, since the function goes to zero with an infinite derivative there. However since both the domain and range are bounded the absolute error in the result cannot be large.2) Since the rational approximation is the same as that in ASIN, the relation cos^(-1)(x) = p/2 - sin^(-1)(x)is preserved.
ATAN
DATAN
REAL32 FUNCTION ATAN (VAL REAL32 X)
REAL64 FUNCTION DATAN (VAL REAL64 X)
These compute: tan^(-1)(X) (in radians)
Domain: [-Inf, Inf] Range: [-Pi/2, Pi/2] Primary Domain: [-z, z], z = 2 - p3 = 0.2679
Exceptions
None.
Propagated Error
A = X/(1 + X^(2)), R = X/(tan^(-1)(X)(1 + X^(2)))
Generated Error
Primary Domain Error: MRE RMSRE Single Length: 0.53 ulp 0.21 ulp Double Length: 1.27 ulp 0.52 ulp
The Algorithm
1) For |X| > ATmax, |tan^(-1)(X)| is indistinguishable from p/2 in the floating-point format. For single-length, ATmax = 1.68*10^(7), and for double-length ATmax = 9*10^(15), approximately.2) This function is numerically very stable, despite the complicated argument reduction. The worst errors occur just above 2-p3, but are no more than 1.8 ulp. 3) It is also very well behaved with respect to errors in the argument, i.e. the error amplification factors are always small.
4) The argument reduction scheme ensures that the identities tan^(-1)(X) = p/2 - tan^(-1)(1/X), and tan^(-1)(X) = p/6 + tan^(-1)((p3*X-1)/(p3 + X)) are preserved.
ATAN2
DATAN2
REAL32 FUNCTION ATAN2 (VAL REAL32 X, Y)
REAL64 FUNCTION DATAN2 (VAL REAL64 X, Y)
These compute the angular co-ordinate tan^(-1)(Y/X) (in radians) of a point whose X and Y co-ordinates are given.
Domain: [-Inf, Inf] x [-Inf, Inf] Range: (-Pi, Pi] Primary Domain: See note 2.
Exceptions
(0, 0) and ([[formula]]Inf,[[formula]]Inf) give undefined.NaN.
Propagated Error
A = X(1 [[formula]] Y)/(X^(2) + Y^(2)), R = X(1 [[formula]] Y)/(tan^(-1)(Y/X)(X^(2) + Y^(2))) (See note 3)
Generated Error
See note 2.
The Algorithm
1) This two-argument function is designed to perform rectangular-to-polar co-ordinate conversion.2) See the notes for ATAN for the primary domain and estimates of the generated error.
3) The error amplification factors were derived on the assumption that the relative error in Y is [[formula]] that in X, otherwise there would be separate factors for X and Y. They are small except near the origin, where the polar co-ordinate system is singular.
SINH
DSINH
REAL32 FUNCTION SINH (VAL REAL32 X)
REAL64 FUNCTION DSINH (VAL REAL64 X)
These compute: sinh(X)
Domain:[-Hmax, Hmax] = [-89.4, 89.4]S, [-710.5, 710.5]D Range: (-Inf, Inf) Primary Domain: (-1.0, 1.0)
Exceptions
X < -Hmax gives -Inf, and X > Hmax gives Inf.
Propagated Error
A = X cosh(X), R = X coth(X) (See note 3)
Generated Error
Primary Domain [1.0, XBig]
(See note 2) MRE RMSRE MAE RMSAE Single Length: 0.89 ulp 0.3 ulp 0.98 ulp 0.31 ulp Double Length: 1.3 ulp 0.51 ulp 1.0 ulp 0.3 ulp
The Algorithm
1) Hmax is the point at which sinh(X) becomes too large to be represented in the floating-point format.2) XBig is the point at which e^(-)^(|X | )becomes insignificant compared with e^(|X |), (in floating-point). For single-length it is 8.32, and for double-length it is 18.37.
3) This function is quite stable with respect to errors in the argument. Relative error is magnified near zero, but the absolute error is a better measure near the zero of the function and it is diminished there. For large arguments absolute errors are magnified, but since the function is itself large, relative error is a better criterion, and relative errors are not magnified unduly for any argument in the domain, although the output does become less reliable near the ends of the range.
COSH
DCOSH
REAL32 FUNCTION COSH (VAL REAL32 X)
REAL64 FUNCTION DCOSH (VAL REAL64 X)
These compute: cosh(X)
Domain: [-Hmax, Hmax] = [-89.4, 89.4]S, [-710.5, 710.5]D Range: [1.0, Inf) PrimaryDomain: [-XBig, XBig] = [-8.32, 8.32]S
= [-18.37, 18.37]D
Exceptions
|X| > Hmax gives Inf.
Propagated Error
A = X sinh(X) , R = X tanh(X) (See note 3)
Generated Error
Primary Domain Error: MRE RMS Single Length: 0.99 ulp 0.3 ulp Double Length: 1.23 ulp 0.3 ulp
The Algorithm
1) Hmax is the point at which cosh(X) becomes too large to be represented in the floating-point format.2) XBig is the point at which e^(-)^(|X | )becomes insignificant compared with e^(|X |) (in floating-point).
3) Errors in the argument are not seriously magnified by this function, although the output does become less reliable near the ends of the range.
TANH
DTANH
REAL32 FUNCTION TANH (VAL REAL32 X)
REAL64 FUNCTION DTANH (VAL REAL64 X)
These compute: tanh(X)
Domain: [-Inf, Inf] Range: [-1.0, 1.0] Primary Domain: [-Log(3)/2, Log(3)/2] = [-0.549, 0.549]
Exceptions
None.
Propagated Error
A = X/cosh^(2)(X), R = X/sinh (X) cosh(X)
Generated Error
Primary Domain Error: MRE RMS Single Length: 0.52 ulp 0.2 ulp Double Length: 4.6 ulp 2.6 ulp
The Algorithm
1) As a floating-point number, tanh(X) becomes indistinguishable from its asymptotic values of [[formula]]1.0 for |X| > HTmax, where HTmax is 8.4 for single-length, and 19.06 for double-length. Thus the output of TANH is equal to [[formula]]1.0 for such X.2) This function is very stable and well-behaved, and errors in the argument are always diminished by it.
RAN
DRAN
REAL32,INT32 FUNCTION RAN (VAL INT32 X)
REAL64,INT64 FUNCTION DRAN (VAL INT64 X)
These produce a pseudo-random sequence of integers, and a corresponding sequence of floating-point numbers between zero and one.
Domain: Integers (see note 1) Range: [0.0, 1.0] x Integers
Exceptions
None.
The Algorithm
1) This function has two results, the first a real, and the second an integer (both 32 bits for single-length, and 64 bits for double-length). The integer is used as the argument for the next call to RAN, i.e. it `carries' the pseudo-random linear congruential sequence Nk, and it should be kept in scope for as long as RAN is used. It should be initialized before the first call to RAN but not modified thereafter except by the function itself.2) If the integer parameter is initialized to the same value, the same sequence (both floating-point and integer) will be produced. If a different sequence is required for each run of a program it should be initialized to some `random' value, such as the output of a timer.
3) The integer parameter can be copied to another variable or used in expressions requiring random integers. The topmost bits are the most random. A random integer in the range [0,L] can conveniently be produced by taking the remainder by (L+1) of the integer parameter shifted right by one bit. If the shift is not done an integer in the range [-L,L] will be produced.
4) The modulus M is 2^(32) for single-length and 2^(64) for double-length, and the multipliers, a, have been chosen so that all M integers will be produced before the sequence repeats. However several different integers can produce the same floating-point value and so a floating-point output may be repeated, although the sequence of such will not be repeated until M calls have been made.
5) The floating-point result is uniformly distributed over the output range, and the sequence passes various tests of randomness, such as the `run test', the `maximum of 5 test' and the `spectral test'.
6) The double-length version is slower to execute, but `more random' than the single-length version. If a highly-random sequence of single-length numbers is required, this could be produced by converting the output of DRAN to single-length. Conversely if only a relatively crude sequence of double-length numbers is required, RAN could be used for higher speed and its output converted to double-length.