WARNING: This has been converted from RTF format and may have mistakes compared to the printed version. If you find any in this file, please send them to me, and I will update it.
Dave Beckett <D.J.Beckett@ukc.ac.uk>
Elementary maths and trigonometric functions are provided in three libraries, as follows:
Library Description snglmath.lib<F36><F Single length library B>snglmath.lib dblmath.lib<F36><FB Double length library >dblmath.lib tbmaths.lib<F36><FB TB optimized library >tbmaths.lib
The single and double length libraries contain the same set of maths functions in single and double length forms. The double length forms all begin with the letter `D'. All function names are in upper case.
The TB optimized library is a combined single and double length library containing functions for the T4 series (T400, T414, T425, and T426). The functions have been optimized for speed. The standard single or double length libraries can be used on T4 processors but optimum performance will be achieved by using the TB optimized library. The accuracy of the T400/T414/T425/T426 optimized functions is similar to that of the standard single length functions but results returned may not be identical because different algorithms are used. If the optimized library is used in code compiled for any processor except a T400, T414, T425, or T426, the compiler reports an error.
To obtain the best possible speed performance with the occam maths functions use the following strategy:
The elementary function library is also described in appendix N of the occam 2 Reference manual.
These function subroutines have been written to be compatible with the ANSI standard for binary floating_point arithmetic (ANSI_IEEE std 754_1985), as implemented in occam. They are based on the algorithms in:
Cody, W. J., and Waite, W. M. [1980]. Software Manual for the Elementary Functions. Prentice_Hall, New Jersey.
The only exceptions are the pseudo_random number generators, which are based on algorithms in:
Knuth, D. E. [1981]. The Art of Computer Programming, 2nd. edition, Volume 2: Seminumerical Algorithms. Addison_Wesley, Reading, Mass.
Error conditions are reported by means of three distinct NaNs:
The implementations will return the following values for these Not_a_Numbers:
Error Single length value Double length value undefined.NaN #7F800010 #7FF00002 00000000 unstable.NaN #7F800008 #7FF00001 00000000 inexact.NaN #7F800004 #7FF00000 80000000
For most of the functions the quoted error is for arguments in the primary domain, which represents the basic accuracy of the approximation. For some functions the process of range reduction results in a higher accuracy for arguments outside the primary domain, and for others it does the reverse. Refer to the notes on each function for more details.
If f is the mathematical function and F the subroutine approximation, then the relative error at the floating_point number X (provided f_(X) is not zero) is:
Obviously the relative error may become very large near a zero of f_(X). If the zero is at an irrational argument (which cannot be represented as a floating_point value), the absolute error is a better measure of the accuracy of the function near the zero.
As it is impractical to find the relative error for every possible argument, statistical measures of the overall error must be used. If the relative error is sampled at a number of points Xn (n = 1 to N), then useful statistics are the maximum relative error and the root_mean_square relative error:
Corresponding statistics can be formed for the absolute error also, and are called MAE and RMSAE respectively.
The MRE generally occurs near a zero of the function, especially if the true zero is irrational, or near a singularity where the result is large, since the `granularity' of the floating_point numbers then becomes significant.
A useful unit of relative error is the relative magnitude of the least significant bit in the floating_point fraction, which is called one `unit in the last place' (ulp), (i.e. the smallest e such that 1+e 01). Its magnitude depends on the floating_point format: for single_length it is 2^(-23) = 1.19*10^(-7), and for double_length it is 2^(-52 )= 2.22*10^(-16).
If the relative error in the argument X is d (typically a few ulp), then the absolute error (E) and relative error (e) in f_(X) are:
This defines the absolute and relative error magnification factors A and R. When both are large the function is unstable, i.e. even a small error in the argument, such as would be produced by evaluating a floating_point expression, will cause a large error in the value of the function. The functions return an unstable.NaN in such cases which are simple to detect.
The functional forms of both A and R are given in the specification of each function.
In both cases the higher_precision implementation was used to approximate the mathematical function (called f above) in the computation of the error, which was evaluated in the higher precision to avoid rounding errors. Error statistics were produced according to the formulae above.
All single length functions except RAN return a single result of type REAL32, and all except RAN, POWER and ATAN2 have one parameter, a VAL REAL32.
POWER and ATAN2 have two parameters which are VAL REAL32s for the two arguments of each function.
RAN returns two results, of types REAL32 and INT32, and has one parameter which is a VAL INT32.
In each case the double_length version of name is called Dname, returns a REAL64 (except DRAN, which returns REAL64, INT64), and has parameters of type VAL REAL64 (VAL INT64 for DRAN).
For the standard single_length format it is 2^(-23) = 1.19*10^(-7).
For the double_length format it is 2^(-52) = 2.22*10^(-16).
This is also used as a measure of absolute error, since such errors can be considered `relative' to unity.
Where the range depends on the floating_point format, single_length is indicated with an S and double_length with a D.
For functions with two arguments the complete range of both arguments is given. This means that for each number in one range, there is at least one (though sometimes only one) number in the other range such that the pair of arguments is valid. Both ranges are shown, linked by an `x'.
Pi means the closest floating_point representation of the transcendental number p, ln(2) the closest representation of log_e(2), and so on.
In describing the algorithms, `X' is used generically to designate the argument, and `result' (or RESULT, in the style of occam functions) to designate the output.
These two libraries will be efficient on processors with fast floating_point arithmetic and good support for the floating_point predefined functions such as MULBY2 and ARGUMENT.REDUCE. For 32_bit processors without special hardware for floating_point calculations the alternative optimized library described in section 1.4.3 using fixed_point arithmetic will be faster, but will not give identical results.
A special version has been produced for 16_bit transputers, which avoids the use of any double_precision arithmetic in the single precision functions. This is distinguished in the notes by the annotation `T2 special'; notes relating to the version for T8 and TB are denoted by `standard'.
Single and double length maths functions are listed below. Descriptions of the functions can be found in succeeding sections.
To use the single length library a program header must include the line
#USE "snglmath.lib"
To use the double length library a program header must include the line
#USE "dblmath.lib"
Result(s) Function Parameter specifiers REAL32 ALOG VAL REAL32 X REAL32 ALOG10 VAL REAL32 X REAL32 EXP VAL REAL32 X REAL32 POWER VAL REAL32 X, VAL REAL32 Y REAL32 SIN VAL REAL32 X REAL32 COS VAL REAL32 X REAL32 TAN VAL REAL32 X REAL32 ASIN VAL REAL32 X REAL32 ACOS VAL REAL32 X REAL32 ATAN VAL REAL32 X REAL32 ATAN2 VAL REAL32 X, VAL REAL32 Y REAL32 SINH VAL REAL32 X REAL32 COSH VAL REAL32 X REAL32 TANH VAL REAL32 X REAL32,INT32 RAN VAL INT32 X REAL64 DALOG VAL REAL64 X REAL64 DALOG10 VAL REAL64 X REAL64 DEXP VAL REAL64 X REAL64 DPOWER VAL REAL64 X, VAL REAL64 Y REAL64 DSIN VAL REAL64 X REAL64 DCOS VAL REAL64 X REAL64 DTAN VAL REAL64 X REAL64 DASIN VAL REAL64 X REAL64 DACOS VAL REAL64 X REAL64 DATAN VAL REAL64 X REAL64 DATAN2 VAL REAL64 X, VAL REAL64 Y REAL64 DSINH VAL REAL64 X REAL64 DCOSH VAL REAL64 X REAL64 DTANH VAL REAL64 X REAL64,INT64 DRAN VAL INT64 X
REAL32 FUNCTION ALOG (VAL REAL32 X)
REAL64 FUNCTION DALOG (VAL REAL64 X)
Compute loge(X).
Domain: (0, XMAX] Range: [MinLog, MaxLog] [MinLog, MaxLog] (See note 2) Primary Domain: [p2/2, p2) = [0.7071, 1.4142)
Exceptions
All arguments outside the domain generate an undefined.NaN.
Propagated Error
A 5 1, R +1/loge(X)
Generated Error
Primary Domain Error: MRE RMSRE Single Length(Standard): 1.7 ulp 0.43 ulp Single Length(T2 special): 1.6 ulp 0.42 ulp Double Length: 1.4 ulp 0.38 ulp
The Algorithm
1) The term ln(2) * N is much easier to compute (and more accurate) than LnF, and it is larger provided N is not 0 (i.e. for arguments outside the primary domain). Thus the accuracy of the result improves as the modulus of log(X) increases.2) The minimum value that can be produced, MinLog, is the logarithm of the smallest denormalized floating_point number. For single length MinLog is -103.28, and for double length it is -744.4. The maximum value MaxLog is the logarithm of XMAX. For single_length it is 88.72, and for double_length it is 709.78.
3) Since Inf is used to represent all values greater than XMAX its logarithm cannot be defined.
4) This function is well_behaved and does not seriously magnify errors in the argument.
ALOG10
DALOG10
REAL32 FUNCTION ALOG10 (VAL REAL32 X)
REAL64 FUNCTION DALOG10 (VAL REAL64 X)
Compute log10(X).
Domain: (0, XMAX] Range: [MinL10, MaxL10] (See note 2) Primary Domain: [p2/2, p2) = [0.7071, 1.4142)
Exceptions
All arguments outside the domain generate an undefined.NaN.
Propagated Error
A _ log10(e), R = log10(e)/loge(X)
Generated Error
Primary Domain Error: MRE RMSRE Single Length (Standard): 1.70 ulp 0.45 ulp Single Length (T2 special): 1.71 ulp 0.46 ulp Double Length: 1.84 ulp 0.45 ulp
The Algorithm
1) See note 1 for ALOG.2) The minimum value that can be produced, MinL10, is the base_10 logarithm of the smallest denormalized floating_point number. For single length MinL10 is -44.85, and for double length it is -323.3. The maximum value MaxL10 is the base_10 logarithm of XMAX. For single length MaxL10 is 38.53, and for double_length it is 308.26.
3) Since Inf is used to represent all values greater than XMAX its logarithm cannot be defined.
4) This function is well_behaved and does not seriously magnify errors in the argument.
EXP
DEXP
REAL32 FUNCTION EXP (VAL REAL32 X)
REAL64 FUNCTION DEXP (VAL REAL64 X)
Compute e^(X).
Domain: [-Inf, MaxLog) = [-Inf, 88.72)S, [-Inf, 709.78)D Range: [0, Inf) (See note 4) Primary Domain: [-Ln2/2, Ln2/2) = [-0.3466, 0.3466)
Exceptions
All arguments outside the domain generate an Inf.
Propagated error
A = Xe^(X), R = X
Generated error
Primary Domain Error: MRE RMSRE Single Length(Standard): 0.99 ulp 0.25 ulp Single Length(T2 special): 1.0 ulp 0.25 ulp Double Length: 1.4 ulp 0.25 ulp
The Algorithm
1) MaxLog is log_e(XMAX).2) For sufficiently negative arguments (below -87.34 for single_length and below -708.4 for double_length) the output is denormalized, and so the floating_point number contains progressively fewer significant digits, which degrades the accuracy. In such cases the error can theoretically be a factor of two.
3) Although the true exponential function is never zero, for large negative arguments the true result becomes too small to be represented as a floating_point number, and EXP underflows to zero. This occurs for arguments below -103.9 for single_length, and below -745.2 for double_length.
4) The propagated error is considerably magnified for large positive arguments, but diminished for large negative arguments.
POWER
DPOWER
REAL32 FUNCTION POWER (VAL REAL32 X, Y)
REAL64 FUNCTION DPOWER (VAL REAL64 X, Y)
Compute X^(Y).
Domain: [0, Inf] x [-Inf, Inf] Range: (-Inf, Inf) Primary Domain: See note 3.
Exceptions
If the first argument is outside its domain, undefined.NaN is returned. If the true value of X^(Y) exceeds XMAX, Inf is returned. In certain other cases other NaNs are produced: See note 2.
Propagated Error
A = YX^(Y)(1 _ log_e(X)), R = Y(1 _ log_e(X)) (See note 4)
Generated error
Example Range Error: MRE RMSRE (See note 3) Single Length(Standard): 1.0 ulp 0.25 ulp Single Length(T2 special): 63.1 ulp 13.9 ulp Double Length: 21.1 ulp 2.4 ulp
The Algorithm
Deal with special cases: either argument = 1, 0, +Inf or -Inf (see note 2). Otherwise:(a) For the standard single precision:
(b) For double precision, and the single precision special version:
1) This subroutine implements the mathematical function x^(y) to a much greater accuracy than can be attained using the ALOG and EXP functions, by performing each step in higher precision. The single_precision version is more efficient than using DALOG and EXP because redundant tests are omitted.2) Results for special cases are as follows:
First Input (X) Second Input (Y) Result t0 ANY undefined.NaN 0 v0 undefined.NaN 0 0 tY_v_XMAX 0 0 Inf unstable.NaN 0 t X t1 Inf 0 0 t X t1 -Inf Inf 1 -XMAX_v_Y_v XMAX 1 1 " Inf unstable.NaN 1 t X v XMAX Inf Inf 1 t X v XMAX -Inf 0 Inf 1 v Y v Inf Inf Inf -Inf vY v -1 0 Inf -1 t Y t1 undefined.NaN otherwise 0 1 otherwise 1 X
3) Performing all the calculations in extended precision makes the double_precision algorithm very complex in detail, and having two arguments makes a primary domain difficult to specify. As an indication of accuracy, the functions were evaluated at 100_000 points logarithmically distributed over (0.1, 10.0), with the exponent linearly distributed over (-35.0, 35.0) (single_length), and (-300.0, 300.0) (double_length), producing the errors given above. The errors are much smaller if the exponent range is reduced.4) The error amplification factors are calculated on the assumption that the relative error in Y is _ that in X, otherwise there would be separate factors for both X and Y. It can be seen that the propagated error will be greatly amplified whenever loge(X) or Y is large.
SIN
DSIN
REAL32 FUNCTION SIN (VAL REAL32 X)
REAL64 FUNCTION DSIN (VAL REAL64 X)
Compute sine(X) (where X is in radians).
Domain: [-Smax, Smax] = [-205887.4, 205887.4]S (Standard),
= [-4.2*106, 4.2*106]S (T2 special)
= [-4.29*109, 4.29*109]D Range: [-1.0, 1.0] Primary Domain: [-Pi/2, Pi/2]= [-1.57, 1.57]
Exceptions
All arguments outside the domain generate an inexact.NaN, except _ Inf, which generates an undefined.NaN.
Propagated Error
A = X cos(X), R = X cot(X)
Generated error (See note 1)
Primary Domain [0, 2Pi] MRE RMSRE MAE RMSAE Single Length(Standard): 0.94 ulp 0.23 ulp 0.96 ulp 0.19 ulp Single Length(T2 special): 0.92 ulp 0.23 ulp 0.94 ulp 0.19 ulp Double Length: 0.90 ulp 0.22 ulp 0.91 ulp 0.18 ulp
The Algorithm
Notes1) For arguments outside the primary domain the accuracy of the result depends crucially on step 2. The extra precision of step 2 is lost if N becomes too large, and the cut_off Smax is chosen to prevent this. In any case for large arguments the `granularity' of floating_point numbers becomes a significant factor. For arguments larger than Smax a change in the argument of 1 ulp would change more than half of the significant bits of the result, and so the result is considered to be essentially indeterminate.
2) The propagated error has a complex behavior. The propagated relative error becomes large near each zero of the function (outside the primary range), but the propagated absolute error only becomes large for large arguments. In effect, the error is seriously amplified only in an interval about each irrational zero, and the width of this interval increases roughly in proportion to the size of the argument.
3) Since only the remainder of X by Pi is used in step 3, the symmetry
sin(x+ np ) = _ sin (x) is preserved, although there is a complication due to differing precision representations of p.4) The output range is not exceeded. Thus the output of SIN is always a valid argument for ASIN.
COS
DCOS
REAL32 FUNCTION COS (VAL REAL32 X)
REAL64 FUNCTION DCOS (VAL REAL64 X)
Compute cosine(X) (where X is in radians).
Domain: [-Cmax, Cmax] = [-205887.4, 205887.4]S (Standard),
= [-12868.0, 12868.0]S (T2 special)
= [-2.1*108, 2.1*108]D Range: [-1.0, 1.0] Primary Domain: See note 1.
Exceptions
All arguments outside the domain generate an inexact.NaN, except _Inf, which generates an undefined.NaN.
Propagated Error
A = -X sin (X), R = -X tan (X) (See note 4)
Generated error
[0, Pi/4) [0, 2Pi] MRE RMSRE MAE RMSAE
Single Length(Standard): 0.93 ulp 0.25 ulp 0.88 ulp 0.18 ulp Single Length(T2 special): 1.1 ulp 0.3 ulp 0.94 ulp 0.19 ulp Double Length: 1.0 ulp 0.28 ulp 0.90 ulp 0.19 ulp
The Algorithm
Notes1) Inspection of the algorithm shows that argument reduction always occurs, thus there is no `primary domain' for COS. So for all arguments the accuracy of the result depends crucially on step 2. The standard single_precision version performs the argument reduction in double_precision, so there is effectively no loss of accuracy at this step. For the T2 special version and the double_precision version there are effectively K extra bits in the representation of p(K=8 for the former and 12 for the latter). If the argument agrees with an odd integer multiple of p/2 to more than k bits there is a loss of significant bits from the computed remainder equal to the number of extra bits of agreement, and this causes a loss of accuracy in the result.
2) The difference between COS evaluated at successive floating_point numbers is given approximately by the absolute error amplification factor, A. For arguments larger than Cmax this difference may be more than half the significant bits of the result, and so the result is considered to be essentially indeterminate and an inexact.NaN is returned. The extra precision of step 2 in the double_precision and T2 special versions is lost if N becomes too large, and the cut_off at Cmax prevents this also.
3) For small arguments the errors are not evenly distributed. As the argument becomes smaller there is an increasing bias towards negative errors (which is to be expected from the form of the Taylor series). For the single_length version and X in [-0.1, 0.1], 62% of the errors are negative, whilst for X in [-0.01, 0.01], 70% of them are.
4) The propagated error has a complex behavior. The propagated relative error becomes large near each zero of the function, but the propagated absolute error only becomes large for large arguments. In effect, the error is seriously amplified only in an interval about each irrational zero, and the width of this interval increases roughly in proportion to the size of the argument.
5) Since only the remainder of (|X|+Pi/2) by Pi is used in step 3, the symmetry cos(x+ np ) = _ cos(x) is preserved. Moreover, since the same rational approximation is used as in SIN, the relation cos(x) = sin(x+p/2) is also preserved. However, in each case there is a complication due to the different precision representations of p.
6) The output range is not exceeded. Thus the output of COS is always a valid argument for ACOS.
TAN
DTAN
REAL32 FUNCTION TAN (VAL REAL32 X)
REAL64 FUNCTION DTAN (VAL REAL64 X)
Compute tan(X) (where X is in radians).
Domain: [-Tmax, Tmax] = [-102943.7, 102943.7]S(Standard),
= [-2.1*106, 2.1*106]S(T2 special),
= [-2.1*109, 2.1*109]D Range: (-Inf, Inf) Primary Domain: [-Pi/4, Pi/4]= [-0.785, 0.785]
Exceptions
All arguments outside the domain generate an inexact.NaN, except _ Inf, which generate an undefined.NaN. Odd integer multiples of p/2 may produce unstable.NaN.
Propagated Error
A = X(1+ tan^(2)(X)), R = X(1+ tan^(2) (X))/tan(X) (See note 3)
Generated error
Primary Domain Error: MRE RMSRE (See note 3) Single Length(Standard): 1.44 ulp 0.39 ulp Single Length(T2 special): 1.37 ulp 0.39 ulp Double Length: 1.27 ulp 0.35 ulp
The Algorithm
1) R is large whenever X is near to an integer multiple of p/2, and so tan is very sensitive to small errors near its zeros and singularities. Thus for arguments outside the primary domain the accuracy of the result depends crucially on step 2, so this is performed with very high precision, using double precision Pi/2 for the standard single_precision function and two double_precision floating_point numbers for the representation of p_/2 for the double_precision function. The T2 special version uses two single_precision floating_point numbers. The extra precision is lost if N becomes too large, and the cut_off Tmax is chosen to prevent this.2) The difference between TAN evaluated at successive floating_point numbers is given approximately by the absolute error amplification factor, A. For arguments larger than Smax this difference could be more than half the significant bits of the result, and so the result is considered to be essentially indeterminate and an inexact.NaN is returned.
3) Tan is quite badly behaved with respect to errors in the argument. Near its zeros outside the primary domain the relative error is greatly magnified, though the absolute error is only proportional to the size of the argument. In effect, the error is seriously amplified in an interval about each irrational zero, whose width increases roughly in proportion to the size of the argument. Near its singularities both absolute and relative errors become large, so any large output from this function is liable to be seriously contaminated with error, and the larger the argument, the smaller the maximum output which can be trusted. If step 3 of the algorithm requires division by zero, an unstable.NaN is produced instead.
4) Since only the remainder of X by Pi/2 is used in step 3, the symmetry tan(x+ np ) = tan(x) is preserved, although there is a complication due to the differing precision representations of p. Moreover, by step 3 the symmetry tan(x) = 1/tan( p/2 - x) is also preserved.
ASIN
DASIN
REAL32 FUNCTION ASIN (VAL REAL32 X)
REAL64 FUNCTION DASIN (VAL REAL64 X)
Compute sine^(-1)(X) (in radians).
Domain: [-1.0, 1.0] Range: [-Pi/2, Pi/2] Primary Domain: [-0.5, 0.5]
Exceptions
All arguments outside the domain generate an undefined.NaN.
Propagated Error
A = X/p1 - X^(2), R = X/(sin^(-1)(X) p1 - X^(2))
Generated Error
Primary Domain [-1.0, 1.0] MRE RMSRE MAE RMSAE
Single Length: 0.58 ulp 0.21 ulp 1.35 ulp 0.33 ulp Double Length: 0.59 ulp 0.21 ulp 1.26 ulp 0.27 ulp
The Algorithm
1) The error amplification factors are large only near the ends of the domain. Thus there is a small interval at each end of the domain in which the result is liable to be contaminated with error: however since both domain and range are bounded the absolute error in the result cannot be large.2) By step 1, the identity sin^(-1)(x) = p/2 - 2 sin^(-1)(p(1-x)/2)) is preserved.
ACOS
DACOS
REAL32 FUNCTION ACOS (VAL REAL32 X)
REAL64 FUNCTION DACOS (VAL REAL64 X)
Compute cosine^(-1)(X) (in radians).
Domain: [-1.0, 1.0] Range: [0, Pi] Primary Domain: [-0.5, 0.5]
Exceptions
All arguments outside the domain generate an undefined.NaN.
Propagated Error
A = -X/p1 - X^(2), R = -X/(sin^(-1)(X) p1 - X^(2))
Generated Error
Primary Domain [-1.0, 1.0] MRE RMSRE MAE RMSAE Single Length: 1.06 ulp 0.38 ulp 2.37 ulp 0.61 ulp Double Length: 0.96 ulp 0.32 ulp 2.25 ulp 0.53 ulp
The Algorithm
1) The error amplification factors are large only near the ends of the domain. Thus there is a small interval at each end of the domain in which the result is liable to be contaminated with error, although this interval is larger near 1 than near -1, since the function goes to zero with an infinite derivative there. However since both the domain and range are bounded the absolute error in the result cannot be large.2) Since the rational approximation is the same as that in ASIN, the relation cos^(-1)(x) = p/2 - sin^(-1)(x) is preserved.
ATAN
DATAN
REAL32 FUNCTION ATAN (VAL REAL32 X)
REAL64 FUNCTION DATAN (VAL REAL64 X)
Compute tan^(-1)(X) (in radians).
Domain: [-Inf, Inf] Range: [-Pi/2, Pi/2] Primary Domain: [-z, z], z = 2 - p3 = 0.2679
Exceptions
None.
Propagated Error
A = X/(1 + X^(2)), R = X/(tan^(-1)(X)(1 + X^(2)))
Generated Error
Primary Domain Error: MRE RMSRE Single Length: 0.56 ulp 0.21 ulp Double Length: 0.52 ulp 0.21 ulp
The Algorithm
1) For |X| > ATmax, |tan^(-1)(X)| is indistinguishable from p/2 in the floating_point format. For single_length, ATmax = 1.68*10^(7), and for double_length ATmax = 9*10^(15), approximately.2) This function is numerically very stable, despite the complicated argument reduction. The worst errors occur just above 2-p3, but are no more than 3.2 ulp.
3) It is also very well behaved with respect to errors in the argument, i.e. the error amplification factors are always small.
4) The argument reduction scheme ensures that the identities tan^(-1)(X) = p/2 - tan^(-1)(1/X), and tan^(-1)(X) = p/6 + tan^(-1)((p3*X-1)/(p3 + X)) are preserved.
ATAN2
DATAN2
REAL32 FUNCTION ATAN2 (VAL REAL32 X, Y)
REAL64 FUNCTION DATAN2 (VAL REAL64 X, Y)
Compute the angular co_ordinate tan^(-1)(Y/X) (in radians) of a point whose X and Y co_ordinates are given.
Domain: [-Inf, Inf] x [-Inf, Inf] Range: (-Pi, Pi] Primary Domain: See note 2.
Exceptions
(0, 0) and (_Inf,_Inf) give undefined.NaN.
Propagated Error
A = X(1 _ Y)/(X^(2) + Y^(2)), R= X(1 _ Y)/(tan^(-1)(Y/X)(X^(2) + Y^(2))) (See note 3)
Generated Error
See note 2.
The Algorithm
1) This two_argument function is designed to perform rectangular_to_polar co_ordinate conversion.2) See the notes for ATAN for the primary domain and estimates of the generated error.
3) The error amplification factors were derived on the assumption that the relative error in Y is _ that in X, otherwise there would be separate factors for X and Y. They are small except near the origin, where the polar co_ordinate system is singular.
SINH
DSINH
REAL32 FUNCTION SINH (VAL REAL32 X)
REAL64 FUNCTION DSINH (VAL REAL64 X)
Compute sinh(X).
Domain: [-Hmax, Hmax] = [-89.4, 89.4]S, [-710.5, 710.5]D Range: (-Inf, Inf) Primary Domain: (-1.0, 1.0)
Exceptions
X < -Hmax gives -Inf, and X > Hmax gives Inf.
Propagated Error
A = Xcosh(X), R = Xcoth(X) (See note 3)
Generated Error
Primary Domain [1.0, XBig] (See note 2) MRE RMSRE MAE RMSAE Single Length: 0.91 ulp 0.26 ulp 1.41 ulp 0.34 ulp Double Length: 0.67 ulp 0.22 ulp 1.31 ulp 0.33 ulp
The Algorithm
1) Hmax is the point at which sinh(X) becomes too large to be represented in the floating_point format.2) XBig is the point at which e^(-)^(|X_| )becomes insignificant compared with e^(|X_|), (in floating_point). For single_length it is 8.32, and for double_length it is 18.37.
3) This function is quite stable with respect to errors in the argument. Relative error is magnified near zero, but the absolute error is a better measure near the zero of the function and it is diminished there. For large arguments absolute errors are magnified, but since the function is itself large, relative error is a better criterion, and relative errors are not magnified unduly for any argument in the domain, although the output does become less reliable near the ends of the range.
COSH
DCOSH
REAL32 FUNCTION COSH (VAL REAL32 X)
REAL64 FUNCTION DCOSH (VAL REAL64 X)
Compute cosh(X).
Domain: [-Hmax, Hmax] = [-89.4, 89.4]S, [-710.5, 710.5]D Range: [1.0, Inf) Primary Domain: [-XBig, XBig]= [-8.32, 8.32]S
= [-18.37, 18.37]D
Exceptions
|X| > Hmax gives Inf.
Propagated Error
A = X sinh(X) , R = X tanh(X) (See note 3)
Generated Error
Primary Domain Error: MRE RMS Single Length: 1.24 ulp 0.32 ulp Double Length: 1.24 ulp 0.33 ulp
The Algorithm
1) Hmax is the point at which cosh(X) becomes too large to be represented in the floating_point format.2) XBig is the point at which e^(-)^(|X_| )becomes insignificant compared with e^(|X_| )(in floating_point).
3) Errors in the argument are not seriously magnified by this function, although the output does become less reliable near the ends of the range.
TANH
DTANH
REAL32 FUNCTION TANH (VAL REAL32 X)
REAL64 FUNCTION DTANH (VAL REAL64 X)
Compute tanh(X).
Domain: [-Inf, Inf] Range: [-1.0, 1.0] Primary Domain: [-Log(3)/2, Log(3)/2] = [-0.549, 0.549]
Exceptions
None.
Propagated Error
A = X/cosh^(2)(X), R = X/sinh(X) cosh(X)
Generated Error
Primary Domain Error: MRE RMS Single Length: 0.53 ulp 0.2 ulp Double Length: 0.53 ulp 0.2 ulp
The Algorithm
1) As a floating_point number, tanh(X) becomes indistinguishable from its asymptotic values of _1.0 for |X|> HTmax, where HTmax is 8.4 for single_length, and 19.06 for double_length. Thus the output of TANH is equal to _1.0 for such X.2) This function is very stable and well_behaved, and errors in the argument are always diminished by it.
RAN
DRAN
REAL32,INT32 FUNCTION RAN (VAL INT32 X)
REAL64,INT64 FUNCTION DRAN (VAL INT64 X)
These produce a pseudo_random sequence of integers, or a corresponding sequence of floating_point numbers between zero and one. X is the seed integer that initiates the sequence.
Domain: Integers (see note 1) Range: [0.0, 1.0] x Integers
Exceptions
None.
The Algorithm
1) This function has two results, the first a real, and the second an integer (both 32 bits for single_length, and 64 bits for double_length). The integer is used as the argument for the next call to RAN, i.e. it `carries' the pseudo_random linear congruential sequence Nk, and it should be kept in scope for as long as RAN is used. It should be initialized before the first call to RAN but not modified thereafter except by the function itself.2) If the integer parameter is initialized to the same value, the same sequence (both floating_point and integer) will be produced. If a different sequence is required for each run of a program it should be initialized to some `random' value, such as the output of a timer.
3) The integer parameter can be copied to another variable or used in expressions requiring random integers. The topmost bits are the most random. A random integer in the range [0,L] can conveniently be produced by taking the remainder by (L+1) of the integer parameter shifted right by one bit. If the shift is not done an integer in the range [-L,L] will be produced.
4) The modulus M is 2^(32) for single_length and 2^(64) for double_length, and the multipliers, a, have been chosen so that all M integers will be produced before the sequence repeats. However several different integers can produce the same floating_point value and so a floating_point output may be repeated, although the sequence of such will not be repeated until M calls have been made.
5) The floating_point result is uniformly distributed over the output range, and the sequence passes various tests of randomness, such as the `run test', the `maximum of 5 test' and the `spectral test'.
6) The double_length version is slower to execute, but `more random' than the single_length version. If a highly_random sequence of single_length numbers is required, this could be produced by converting the output of DRAN to single_length. Conversely if only a relatively crude sequence of double_length numbers is required, RAN could be used for higher speed and its output converted to double_length.
#USE "tbmaths.lib"
The version of the library described by this section has been written for 32_bit processors without hardware for floating_point arithmetic. Functions from it will give results very close, but not identical to, those produced by the corresponding functions from the single and double length libraries.
This is the version specifically intended to derive maximum performance from the IMS T400, T414, T425, and T426 processors. The single_precision functions make use of the FMUL instruction available on 32_bit processors without floating_point hardware. The library is compiled for transputer class TB.
The tables and notes at the beginning of section 1.4 apply equally here. However all the functions are contained in one library.
REAL32 FUNCTION ALOG (VAL REAL32 X)
REAL64 FUNCTION DALOG (VAL REAL64 X)
These compute: loge(X)
Domain: (0, XMAX] Range: [MinLog, MaxLog] (See note 2) Primary Domain: [p2/2, p2) = [0.7071, 1.4142)
Exceptions
All arguments outside the domain generate an undefined.NaN.
Propagated Error
A 51, R = loge(X)
Generated Error
Primary Domain Error: MRE RMSRE Single Length: 1.19 ulp 0.36 ulp Double Length: 2.4 ulp 1.0 ulp
The Algorithm
1) The term ln(2) * N is much easier to compute (and more accurate) than LnF, and it is larger provided N is not 0 (i.e. for arguments outside the primary domain). Thus the accuracy of the result improves as the modulus of log(X) increases.2) The minimum value that can be produced, MinLog, is the logarithm of the smallest denormalized floating_point number. For single length MinLog is -103.28, and for double length it is -744.4. The maximum value MaxLog is the logarithm of XMAX. For single_length it is 88.72, and for double_length it is 709.78.
3) Since Inf is used to represent all values greater than XMAX its logarithm cannot be defined.
4) This function is well_behaved and does not seriously magnify errors in the argument.
ALOG10
DALOG10
REAL32 FUNCTION ALOG10 (VAL REAL32 X)
REAL64 FUNCTION DALOG10 (VAL REAL64 X)
These compute: log10(X)
Domain: (0, XMAX] Range: [MinL10, MaxL10] (See note 2) Primary Domain: [p2/2, p2) = [0.7071, 1.4142)
Exceptions
All arguments outside the domain generate an undefined.NaN.
Propagated Error
A 5 log10(e), R = log10(e)/loge(X)
Generated Error
Primary Domain Error: MRE RMSRE Single Length: 1.43 ulp 0.39 ulp Double Length: 2.64 ulp 0.96 ulp
The Algorithm
1) See note 1 for ALOG.2) The minimum value that can be produced, MinL10, is the base_10 logarithm of the smallest denormalized floating_point number. For single length MinL10 is -44.85, and for double length it is -323.3. The maximum value MaxL10 is the base_10 logarithm of XMAX. For single length MaxL10 is 38.53, and for double_length it is 308.26.
3) Since Inf is used to represent all values greater than XMAX its logarithm cannot be defined.
4) This function is well_behaved and does not seriously magnify errors in the argument.
EXP
DEXP
REAL32 FUNCTION EXP (VAL REAL32 X)
REAL64 FUNCTION DEXP (VAL REAL64 X)
These compute: e^(X)
Domain: [-Inf, MaxLog) = [-Inf, 88.03)S, [-Inf, 709.78)D Range: [0, Inf) (See note 4) Primary Domain: [-Ln2/2, Ln2/2) = [-0.3466, 0.3466)
Exceptions
All arguments outside the domain generate an Inf.
Propagated Error
A = Xe^(X), R = X
Generated Error
Primary Domain Error: MRE RMSRE Single Length: 0.51 ulp 0.21 ulp Double Length: 0.5 ulp 0.21 ulp
The Algorithm
1) MaxLog is loge(XMAX).2) The analytical properties of e^(x) make the relative error of the result proportional to the absolute error of the argument. Thus the accuracy of step 2, which prepares the argument for the rational approximation, is crucial to the performance of the subroutine. It is completely accurate when N = 0, i.e. in the primary domain, and becomes less accurate as the magnitude of N increases. Since N can attain larger negative values than positive ones, EXP is least accurate for large, negative arguments.
3) For sufficiently negative arguments (below -87.34 for single_length and below -708.4 for double_length) the output is denormalized, and so the floating_point number contains progressively fewer significant digits, which degrades the accuracy. In such cases the error can theoretically be a factor of two.
4) Although the true exponential function is never zero, for large negative arguments the true result becomes too small to be represented as a floating_point number, and EXP underflows to zero. This occurs for arguments below -103.9 for single_length, and below -745.2 for double_length.
5) The propagated error is considerably magnified for large positive arguments, but diminished for large negative arguments.
POWER
DPOWER
REAL32 FUNCTION POWER (VAL REAL32 X, Y)
REAL32 FUNCTION DPOWER (VAL REAL64 X, Y)
These compute: X^(Y)
Domain: [0, Inf] x [-Inf, Inf] Range: (-Inf, Inf) Primary Domain: See note 3.
Exceptions
If the first argument is outside its domain, undefined.NaN is returned. If the true value of X^(Y) exceeds XMAX, Inf is returned. In certain other cases other NaNs are produced: See note 2.
Propagated Error
A = YX^(Y)(1 _ loge(X)), R = Y(1 _ loge(X)) (See note 4)
Generated Error
Example Range Error: MRE RMSRE (See note 3) Single Length: 1.0 ulp 0.24 ulp Double Length: 13.2 ulp 1.73 ulp
The Algorithm
Deal with special cases: either argument = 1, 0, +Inf or -Inf (see note 2). Otherwise:(a) For single precision:
(b) For double precision:
1) This subroutine implements the mathematical function x^(y) to a much greater accuracy than can be attained using the ALOG and EXP functions, by performing each step in higher precision.2) Results for special cases are as follows:
First Input (X) Second Input (Y) Result t0 ANY undefined.NaN 0 v0 undefined.NaN 0 0 tY_v_XMAX 0 0 Inf unstable.NaN 0 t X t1 Inf 0 0 t X t1 -Inf Inf 1 -XMAX_v_Y_v XMAX 1 1 " Inf unstable.NaN 1 t X v XMAX Inf Inf 1 t X v XMAX -Inf 0 Inf 1 v Y v Inf Inf Inf -Inf vY v -1 0 Inf -1 t Y t1 undefined.NaN otherwise 0 1 otherwise 1 X
3) Performing all the calculations in extended precision makes the double_precision algorithm very complex in detail, and having two arguments makes a primary domain difficult to specify. As an indication of accuracy, the functions were evaluated at 100_000 points logarithmically distributed over (0.1, 10.0), with the exponent linearly distributed over (-35.0, 35.0) (single_length), and (-300.0, 300.0) (double_length), producing the errors given above. The errors are much smaller if the exponent range is reduced.4) The error amplification factors are calculated on the assumption that the relative error in Y is _ that in X, otherwise there would be separate factors for both X and Y. It can be seen that the propagated error will be greatly amplified whenever log_e(X) or Y is large.
The Algorithm
REAL32 FUNCTION SIN (VAL REAL32 X)
REAL64 FUNCTION DSIN (VAL REAL64 X)
These compute: sine(X) (where X is in radians)
Domain: [-Smax, Smax] = [-12868.0, 12868.0]S,
= [-2.1*108, 2.1*108]D Range: [-1.0, 1.0] Primary Domain: [-Pi/2, Pi/2] = [-1.57, 1.57]
Exceptions
All arguments outside the domain generate an inexact.NaN, except _ Inf, which generates an undefined.NaN.Propagated Error
A = X cos(X), R = X cot(X)
Generated Error (See note 3)
Range: Primary Domain [0, 2Pi] MRE RMSRE MAE RMSAE Single Length: 0.65 ulp 0.22 ulp 0.74 ulp 0.18 ulp Double Length: 0.56 ulp 0.21 ulp 0.64 ulp 0.16 ulp
The Algorithm
1) For arguments outside the primary domain the accuracy of the result depends crucially on step 2. The extended precision corresponds to K extra bits in the representation of p (K = 8 for single_length and 12 for double_length). If the argument agrees with an integer multiple of p to more than K bits there is a loss of significant bits in the remainder, equal to the number of extra bits of agreement, and this causes a loss of accuracy in the result.2) The extra precision of step 2 is lost if N becomes too large, and the cut_off Smax is chosen to prevent this. In any case for large arguments the `granularity' of floating_point numbers becomes a significant factor. For arguments larger than Smax a change in the argument of 1 ulp would change more than half of the significant bits of the result, and so the result is considered to be essentially indeterminate.
3) The propagated error has a complex behavior. The propagated relative error becomes large near each zero of the function (outside the primary range), but the propagated absolute error only becomes large for large arguments. In effect, the error is seriously amplified only in an interval about each irrational zero, and the width of this interval increases roughly in proportion to the size of the argument.
4) Since only the remainder of X by Pi is used in step 3, the symmetry sin(x+ np ) = _ sin(x) is preserved, although there is a complication due to differing precision representations of p.
5) The output range is not exceeded. Thus the output of SIN is always a valid argument for ASIN.
COS
DCOS
REAL32 FUNCTION COS (VAL REAL32 X)
REAL64 FUNCTION DCOS (VAL REAL64 X)
These compute: cosine (X) (where X is in radians)
Domain: [-Smax, Smax] = [-12868.0, 12868.0]S,
= [-2.1*108, 2.1*108]D Range: [-1.0, 1.0] Primary Domain: See note 1.
Exceptions
All arguments outside the domain generate an inexact.NaN, except _Inf, which generates an undefined.NaN.
Propagated Error
A = -X sin(X), R = -X tan(X) (See note 4)
Generated Error
Range: [0,Pi/4) [0, 2Pi] MRE RMSRE MAE RMSAE Single Length: 1.0 ulp 0.28 ulp 0.81 ulp 0.17 ulp Double Length: 0.93 ulp 0.26 ulp 0.76 ulp 0.18 ulp
The Algorithm
1) Inspection of the algorithm shows that argument reduction always occurs, thus there is no `primary domain' for COS. So for all arguments the accuracy of the result depends crucially on step 2. The extended precision corresponds to K extra bits in the representation of p (K = 8 for single_length and 12 for double length). If the argument agrees with an odd integer multiple of p/2 to more than K bits there is a loss of significant bits in the remainder, equal to the number of extra bits of agreement, and this causes a loss of accuracy in the result.2) The extra precision of step 2 is lost if N becomes too large, and the cut_off Smax is chosen to prevent this. In any case for large arguments the `granularity' of floating_point numbers becomes a significant factor. For arguments larger than Smax a change in the argument of 1 ulp would change more than half of the significant bits of the result, and so the result is considered to be essentially indeterminate.
3) For small arguments the errors are not evenly distributed. As the argument becomes smaller there is an increasing bias towards negative errors (which is to be expected from the form of the Taylor series). For the single_length version and X in [-0.1, 0.1], 62% of the errors are negative, whilst for X in [-0.01, 0.01], 70% of them are.
4) The propagated error has a complex behavior. The propagated relative error becomes large near each zero of the function, but the propagated absolute error only becomes large for large arguments. In effect, the error is seriously amplified only in an interval about each irrational zero, and the width of this interval increases roughly in proportion to the size of the argument.
5) Since only the remainder of (|X|+Pi/2) by Pi is used in step 3, the symmetry cos(x+ np ) = _ cos(x) is preserved. Moreover, since the same rational approximation is used as in SIN, the relation cos(x) =sin(x+ p/2) is also preserved. However, in each case there is a complication due to the different precision representations of p.
6) The output range is not exceeded. Thus the output of COS is always a valid argument for ACOS.
TAN
DTAN
REAL32 FUNCTION TAN (VAL REAL32 X)
REAL64 FUNCTION DTAN (VAL REAL64 X)
These compute: tan(X) (where X is in radians)
Domain: [-Tmax, Tmax] = [-6434.0, 6434.0]S = [-1.05*108, 1.05*108]D Range: (-Inf, Inf) Primary Domain: [-Pi/4, Pi/4] = [-0.785, 0.785]
Exceptions
All arguments outside the domain generate an inexact.NaN, except _ Inf, which generate an undefined.NaN. Odd integer multiples of p/2 may produce unstable.NaN.
Propagated Error
A = X(1+ tan^(2)(X)), R = X(1+ tan^(2)(X))/tan(X) (See note 4)
Generated Error
Primary Domain Error: MRE RMSRE Single Length: 3.5 ulp 0.23 ulp Double Length: 0.69 ulp 0.23 ulp
The Algorithm
1) R is large whenever X is near to an integer multiple of p/2, and so tan is very sensitive to small errors near its zeros and singularities. Thus for arguments outside the primary domain the accuracy of the result depends crucially on step 2. The extended precision corresponds to K extra bits in the representation of p/2 (K = 8 for single_length and 12 for double_length). If the argument agrees with an integer multiple of p/2 to more than K bits there is a loss of significant bits in the remainder, approximately equal to the number of extra bits of agreement, and this causes a loss of accuracy in the result.2) The extra precision of step 2 is lost if N becomes too large, and the cut_off Tmax is chosen to prevent this. In any case for large arguments the `granularity' of floating_point numbers becomes a significant factor. For arguments larger than Tmax a change in the argument of 1 ulp would change more than half of the significant bits of the result, and so the result is considered to be essentially indeterminate.
3) Step 3 of the algorithm has been slightly modified in the double_precision version from that given in Cody & Waite to avoid fixed_point underflow in the polynomial evaluation for small arguments.
4) Tan is quite badly behaved with respect to errors in the argument. Near its zeros outside the primary domain the relative error is greatly magnified, though the absolute error is only proportional to the size of the argument. In effect, the error is seriously amplified in an interval about each irrational zero, whose width increases roughly in proportion to the size of the argument. Near its singularities both absolute and relative errors become large, so any large output from this function is liable to be seriously contaminated with error, and the larger the argument, the smaller the maximum output which can be trusted. If step 4 of the algorithm requires division by zero, an unstable.NaN is produced instead.
5) Since only the remainder of X by Pi/2 is used in step 3, the symmetry tan(x+ np) = tan(x) is preserved, although there is a complication due to the differing precision representations of p. Moreover, by step 4 the symmetry tan(x) = 1/ tan( p/2 - x) is also preserved.
ASIN
DASIN
REAL32 FUNCTION ASIN (VAL REAL32 X)
REAL64 FUNCTION DASIN (VAL REAL64 X)
These compute: sin^(-1)(X) (in radians)
Domain: [-1.0, 1.0] Range: [-Pi/2, Pi/2] Primary Domain: [-0.5, 0.5]
Exceptions
All arguments outside the domain generate an undefined.NaN.
Propagated Error
A = X/p1 - X^(2), R = X/(sin^(-1)(X) p1 - X^(2))
Generated Error
Primary Domain [-1.0, 1.0] MRE RMSRE MAE RMSAE Single Length: 0.53 ulp 0.21 ulp 1.35 ulp 0.33 ulp Double Length: 2.8 ulp 1.4 ulp 2.34 ulp 0.64 ulp
The Algorithm
1) The error amplification factors are large only near the ends of the domain. Thus there is a small interval at each end of the domain in which the result is liable to be contaminated with error: however since both domain and range are bounded the absolute error in the result cannot be large.2) By step 1, the identity sin^(-1)(x) = p/2 - 2 sin^(-1)(p(1-x)/2)) is preserved.
ACOS
DACOS
REAL32 FUNCTION ACOS (VAL REAL32 X)
REAL64 FUNCTION DACOS (VAL REAL64 X)
These compute: cosine^(-1)(X) (in radians)
Domain: [-1.0, 1.0] Range: [0, Pi] Primary Domain: [-0.5, 0.5]
Exceptions
All arguments outside the domain generate an undefined.NaN.
Propagated Error
A = -X/p1 - X^(2), R = -X/(sin^(-1)(X) p1 - X^(2))
Generated Error
Primary Domain [-1.0, 1.0] MRE RMSRE MAE RMSAE Single Length: 1.1 ulp 0.38 ulp 2.4 ulp 0.61 ulp Double Length: 1.3 ulp 0.34 ulp 2.9 ulp 0.78 ulp
The Algorithm
1) The error amplification factors are large only near the ends of the domain. Thus there is a small interval at each end of the domain in which the result is liable to be contaminated with error, although this interval is larger near 1 than near -1, since the function goes to zero with an infinite derivative there. However since both the domain and range are bounded the absolute error in the result cannot be large.2) Since the rational approximation is the same as that in ASIN, the relation cos^(-1)(x) = p/2 - sin^(-1)(x)is preserved.
ATAN
DATAN
REAL32 FUNCTION ATAN (VAL REAL32 X)
REAL64 FUNCTION DATAN (VAL REAL64 X)
These compute: tan^(-1)(X) (in radians)
Domain: [-Inf, Inf] Range: [-Pi/2, Pi/2] Primary Domain: [-z, z], z = 2 - p3 = 0.2679
Exceptions
None.
Propagated Error
A = X/(1 + X^(2)), R = X/(tan^(-1)(X)(1 + X^(2)))
Generated Error
Primary Domain Error: MRE RMSRE Single Length: 0.53 ulp 0.21 ulp Double Length: 1.27 ulp 0.52 ulp
The Algorithm
1) For |X| > ATmax, |tan^(-1)(X)| is indistinguishable from p/2 in the floating_point format. For single_length, ATmax = 1.68*10^(7), and for double_length ATmax = 9*10^(15), approximately.2) This function is numerically very stable, despite the complicated argument reduction. The worst errors occur just above 2-p3, but are no more than 1.8 ulp. 3) It is also very well behaved with respect to errors in the argument, i.e. the error amplification factors are always small.
4) The argument reduction scheme ensures that the identities tan^(-1)(X) = p/2 - tan^(-1)(1/X), and tan^(-1)(X) = p/6 + tan^(-1)((p3*X-1)/(p3 + X)) are preserved.
ATAN2
DATAN2
REAL32 FUNCTION ATAN2 (VAL REAL32 X, Y)
REAL64 FUNCTION DATAN2 (VAL REAL64 X, Y)
These compute the angular co_ordinate tan^(-1)(Y/X) (in radians) of a point whose X and Y co_ordinates are given.
Domain: [-Inf, Inf] x [-Inf, Inf] Range: (-Pi, Pi] Primary Domain: See note 2.
Exceptions
(0, 0) and (_Inf,_Inf) give undefined.NaN.
Propagated Error
A = X(1 _ Y)/(X^(2) + Y^(2)), R = X(1 _ Y)/(tan^(-1)(Y/X)(X^(2) + Y^(2))) (See note 3)
Generated Error
See note 2.
The Algorithm
1) This two_argument function is designed to perform rectangular_to_polar co_ordinate conversion.2) See the notes for ATAN for the primary domain and estimates of the generated error.
3) The error amplification factors were derived on the assumption that the relative error in Y is _ that in X, otherwise there would be separate factors for X and Y. They are small except near the origin, where the polar co_ordinate system is singular.
SINH
DSINH
REAL32 FUNCTION SINH (VAL REAL32 X)
REAL64 FUNCTION DSINH (VAL REAL64 X)
These compute: sinh(X)
Domain:[-Hmax, Hmax] = [-89.4, 89.4]S, [-710.5, 710.5]D Range: (-Inf, Inf) Primary Domain: (-1.0, 1.0)
Exceptions
X < -Hmax gives -Inf, and X > Hmax gives Inf.
Propagated Error
A = X cosh(X), R = X coth(X) (See note 3)
Generated Error
Primary Domain [1.0, XBig]
(See note 2) MRE RMSRE MAE RMSAE Single Length: 0.89 ulp 0.3 ulp 0.98 ulp 0.31 ulp Double Length: 1.3 ulp 0.51 ulp 1.0 ulp 0.3 ulp
The Algorithm
1) Hmax is the point at which sinh(X) becomes too large to be represented in the floating_point format.2) XBig is the point at which e^(-)^(|X_| )becomes insignificant compared with e^(|X_|), (in floating_point). For single_length it is 8.32, and for double_length it is 18.37.
3) This function is quite stable with respect to errors in the argument. Relative error is magnified near zero, but the absolute error is a better measure near the zero of the function and it is diminished there. For large arguments absolute errors are magnified, but since the function is itself large, relative error is a better criterion, and relative errors are not magnified unduly for any argument in the domain, although the output does become less reliable near the ends of the range.
COSH
DCOSH
REAL32 FUNCTION COSH (VAL REAL32 X)
REAL64 FUNCTION DCOSH (VAL REAL64 X)
These compute: cosh(X)
Domain: [-Hmax, Hmax] = [-89.4, 89.4]S, [-710.5, 710.5]D Range: [1.0, Inf) PrimaryDomain: [-XBig, XBig] = [-8.32, 8.32]S
= [-18.37, 18.37]D
Exceptions
|X| > Hmax gives Inf.
Propagated Error
A = X sinh(X) , R = X tanh(X) (See note 3)
Generated Error
Primary Domain Error: MRE RMS Single Length: 0.99 ulp 0.3 ulp Double Length: 1.23 ulp 0.3 ulp
The Algorithm
1) Hmax is the point at which cosh(X) becomes too large to be represented in the floating_point format.2) XBig is the point at which e^(-)^(|X_| )becomes insignificant compared with e^(|X_|) (in floating_point).
3) Errors in the argument are not seriously magnified by this function, although the output does become less reliable near the ends of the range.
TANH
DTANH
REAL32 FUNCTION TANH (VAL REAL32 X)
REAL64 FUNCTION DTANH (VAL REAL64 X)
These compute: tanh(X)
Domain: [-Inf, Inf] Range: [-1.0, 1.0] Primary Domain: [-Log(3)/2, Log(3)/2] = [-0.549, 0.549]
Exceptions
None.
Propagated Error
A = X/cosh^(2)(X), R = X/sinh (X) cosh(X)
Generated Error
Primary Domain Error: MRE RMS Single Length: 0.52 ulp 0.2 ulp Double Length: 4.6 ulp 2.6 ulp
The Algorithm
1) As a floating_point number, tanh(X) becomes indistinguishable from its asymptotic values of _1.0 for |X| > HTmax, where HTmax is 8.4 for single_length, and 19.06 for double_length. Thus the output of TANH is equal to _1.0 for such X.2) This function is very stable and well_behaved, and errors in the argument are always diminished by it.
RAN
DRAN
REAL32,INT32 FUNCTION RAN (VAL INT32 X)
REAL64,INT64 FUNCTION DRAN (VAL INT64 X)
These produce a pseudo_random sequence of integers, and a corresponding sequence of floating_point numbers between zero and one.
Domain: Integers (see note 1) Range: [0.0, 1.0] x Integers
Exceptions
None.
The Algorithm
1) This function has two results, the first a real, and the second an integer (both 32 bits for single_length, and 64 bits for double_length). The integer is used as the argument for the next call to RAN, i.e. it `carries' the pseudo_random linear congruential sequence Nk, and it should be kept in scope for as long as RAN is used. It should be initialized before the first call to RAN but not modified thereafter except by the function itself.2) If the integer parameter is initialized to the same value, the same sequence (both floating_point and integer) will be produced. If a different sequence is required for each run of a program it should be initialized to some `random' value, such as the output of a timer.
3) The integer parameter can be copied to another variable or used in expressions requiring random integers. The topmost bits are the most random. A random integer in the range [0,L] can conveniently be produced by taking the remainder by (L+1) of the integer parameter shifted right by one bit. If the shift is not done an integer in the range [-L,L] will be produced.
4) The modulus M is 2^(32) for single_length and 2^(64) for double_length, and the multipliers, a, have been chosen so that all M integers will be produced before the sequence repeats. However several different integers can produce the same floating_point value and so a floating_point output may be repeated, although the sequence of such will not be repeated until M calls have been made.
5) The floating_point result is uniformly distributed over the output range, and the sequence passes various tests of randomness, such as the `run test', the `maximum of 5 test' and the `spectral test'.
6) The double_length version is slower to execute, but `more random' than the single_length version. If a highly_random sequence of single_length numbers is required, this could be produced by converting the output of DRAN to single_length. Conversely if only a relatively crude sequence of double_length numbers is required, RAN could be used for higher speed and its output converted to double_length.