Archive for November, 2007

Remote web server - 238 ARITHMETIC 4.2.4 EMPIRICAL DATA FOR OPERATable 1

Friday, November 23rd, 2007

238 ARITHMETIC 4.2.4 EMPIRICAL DATA FOR OPERATable 1 ND ALIGNMENTS BEFORE ADDITION leu -evl b=2 b= 10 b= 16 b = 64 0 1 2 3 4 5 over 5 0.33 0.12 0.09 0.07 0.07 0.04 0.28 0.47 0.23 0.11 0.03 0.01 0.01 0.13 0.47 0.26 0.10 0.02 0.01 0.02 0.11 0.56 0.27 0.04 0.02 0.02 0.00 0.09 average 3.1 0.9 0.8 0.5 4.2.4. Distribution of Floating Point Numbers In order to analyze the average behavior of floating point arithmetic algorithms (and in particular to determine their average running time), we need some statistical information that allows us to determine how often various cases arise. The purpose of this section is to discuss the empirical and theoretical properties of the distribution of floating point numbers. A. Addition and subtraction routines. The execution time for a floating point addition or subtraction depends largely on the initial difference of exponents, and also on the number of normalization steps required (to the left or to the right). No way is known to give a good theoretical model that tells what characteristics to expect, but extensive empirical investigations have been made by D. W. Sweeney [EM Systems J. 4 (1965), 31-421. By means of a special tracing routine, Sweeney ran six typical large-scale numerical programs, selected from several different computing laboratories, and examined each floating addition or subtraction operation very carefully. Over 250,000 floating point addition-subtractions were involved in gathering this data. About one out of every ten instructions executed by the tested programs was either FADD or FSUB. Let us consider subtraction to be addition preceded by negating the second operand; therefore we may give all the statistics as if we were merely doing addition. Sweeney s results can be summarized as follows: One of the two operands to be added was found to be equal to zero about 9 percent of the time, and this was usually the accumulator (ACC). The other 91 percent of the cases split about equally between operands of the same or of opposite signs, and about equally between cases where ]u] 5 ]w] or ]v] 5 ]u]. The computed answer was zero about 1.4 percent of the time. The difference between exponents had a behavior approximately given by the probabilities shown in Table 1, for various radices b. (The over 5 line of that table includes essentially all of the cases when one operand was zero, but the average line does not include these cases.)

4.2.3 DOUBLE-PRECISION CALCULATIONS 237 For extension (Web site designers) of the

Thursday, November 22nd, 2007

4.2.3 DOUBLE-PRECISION CALCULATIONS 237 For extension of the methods of this section to triple-precision floating point fraction parts, see Y. Ikebe, CACM 8 (1965), 175-177. EXERCISES 1. [16] Try the double-precision division technique by hand, with E = &, when dividing 180000 by 314159. (Thus, let (urn, ~1~1) = (.180, .OOO) and (urn, WL) = (.314, .159), and find the quotient using the method suggested in the text following (2).) 2. [ZOO] Would it be a good idea to insert the instruction ENTX 0 between lines 30 and 31 of Program M, in order to keep unwanted information left over in register X from interfering with the accuracy of the results? 3. [A&O] Explain why overflow cannot occur during Program M. 4. [.Z?] How should Program M be changed so that extra accuracy is achieved, essentially by moving the vertical line in Fig. 4 over to the right one position? Specify all changes that are required, and determine the difference in execution time caused by these changes. b 5. [.24] How should Program A be changed so that extra accuracy is achieved, essen- tially by working with a nine-byte accumulator instead of an eight-byte accumulator to the right of the decimal point? Specify all changes that are required, and determine the difference in execution time caused by these changes. 6. [.% ] Assume that the double-precision subroutines of this section and the single- precision subroutines of Section 4.2.1 are being used in the same main program. Write a subroutine that converts a single-precision floating point number into double-precision form (l), and write another subroutine that converts a double-precision floating point number into single-precision form (reporting exponent overflow or underflow if the conversion is impossible). b 7. [M30] Estimate the accuracy of the double-precision subroutines in this section, by finding bounds 61, 62, and 6s on the relative errors 8. [MZ8] Estimate the accuracy of the improved double-precision subroutines of exercises 4 and 5, in the sense of exercise 7. 9. [M42] T. J. Dekker [Numer. Math. 18 (1971), 224-2421 has suggested an alter- native approach to double precision, based entirely on single-precision floating binary calculations. For example, Theorem 4.2.2C states that U+U = w+r, where w = v@ and r = (U 8 w) $ v, if ]u] 2 (~1 and the radix is 2; here ]r] 5 ]2~]/2~, so the pair (w, r) may be considered a double-precision version of u + w. To add two such pairs (u, u ) $ (w, w ), where ]u ] 5 ]~]/2~ and IV ] 5 ]~]/2~ and ]u] 2 /WI, Dekker suggests computing u + u = w + r (exactly), then s = (r $ w ) @ U (an approximate remainder), and finally returning the value (w $ s, (w 8 (w $ s)) $ s). Study the accuracy and efficiency of this approach when it is used recursively to produce quadruple-precision calculations.

236 ARITHMETIC 4.2.3 01 DFDIV STJ EXITDF Double-precision (Tomcat web server)

Wednesday, November 21st, 2007

236 ARITHMETIC 4.2.3 01 DFDIV STJ EXITDF Double-precision division: 02 JOV OFLO Ensure overflow is off. 03 STA TEMP 04 SLAX 2 Remove exponent. 05 STA ARG VT72 06 STX ARGX Vl 07 LDA ACC (EXPD) 178 SUB TEMP (EXPD) 39 STA EXPO !ZXPO+elL-eev. 10 ENT2 QQ+i r12 t QQ + 1. II LDA ACC 12 LDX ACCX 13 SLAX 2 Remove exponent. 14 SRAX 1 (Cf. Algorithm 4.2.1M) 15 DIV ARG If overflow, it is detected below. 16 STA ACC %?I 17 SLAX 5 Use remainder in further division. 18 DIV ARG 19 STA ACCX fw1 20 LDA ARGX (I : 4) 21 ENTX 0 22 DIV ARG(ABS) rA + [Jb4vl/wmJj/b5. 23 JOV DVZROD Did division cause overflow? 24 MUL ACC(ABS) rAX t Iw,wl/bv,(, approximately. 25 sRAx4 Multiply by b, and save 26 SLC 5 the leading byte in rX. 27 SUB ACCX CABS) Subtract Iwi I. 28 DECA I Force minus sign. 29 SUB WMI 30 JOV *+2 If no overflow, carry one more 31 INCX 1 to upper half. 32 SLC 5 (Now rA 5 0) 33 ADD ACC(ABS) rA +-lwml -IrAl. 34 STA ACC CABS) (Now rA 2 0) 35 LDA ACC wm with correct sign 36 JMP DNORM Normalize and exit. 37 DVZROD HLT 30 Unnormalized or zero divisor 38 IH EQU l(i:i) 39 WMI CON IB-1, BYTE-l (1: I) Word size minus one 1 Here is a table of the approximate average computation times for these double-precision subroutines, compared to the single-precision subroutines that appear in Section 4.2.1: Single precision Double precision Addition 45.5u 84~ Subtraction 49.524 88u Multiplication 48~ 109u Division 52u 126.5~

Free web host - 4.2.3 DOUBLE-PRECISION CALCULATIONS 235 Fig. 4. Double-precision multiplication

Wednesday, November 21st, 2007

4.2.3 DOUBLE-PRECISION CALCULATIONS 235 Fig. 4. Double-precision multiplication of eight-byte fraction parts. .29 LDA ACCX (0 : 4) OZZZZ 30 ADD TEMP (Overflow cannot occur) 31 sRAx4 32 ADD ACC (Overflow cannot occur) 33 JMP DNORM Normalize and exit. 1 Note the careful treatment of signs in this program, and note also the fact that the range of exponents makes it impossible to compute the final exponent using an index register. Program M is perhaps too slipshod in accuracy, since it throws away all the information to the right of the vertical line in Fig. 4; this can make the least significant byte as much as 2 in error. A little more accuracy can be achieved as discussed in exercise 4. Double-precision floating division is the most difficult routine, or at least the most frightening prospect we have encountered so far in this chapter. Actually, it is not terribly complicated, once we see how to do it; let us write the numbers to be divided in the form (2~~ + EUL)/(V~ + EVA), where E is the reciprocal of the word size of the computer, and where v, is assumed to be normalized. The fraction can now be expanded as follows: Since 0 2 1~11 < 1 and l/b 5 Iv,1 < 1, we have Ivr/v,I < b, and the error from dropping terms involving c2 can be disregarded. Our method therefore is to compute wm + EWE= (21, + EUL)/V,, and then to subtract E times w,v~/v, from the result. In the following program, lines 27-32 do the lower half of a double-precision addition, using another method for forcing the appropriate sign as an alternative to the trick of Program A. Program D (Double-precision division). This program adheres to the same con- ventions as Programs A and M.

Web design tools - 234 ARITHMETIC 4.2.3 When the least-significant halves are

Tuesday, November 20th, 2007

234 ARITHMETIC 4.2.3 When the least-significant halves are added together in this program, an extra digit 1 is inserted at the left of the word that is known to have the correct sign. After the addition, this byte can be 0, 1, or 2, depending on the circumstances, and all three cases are handled simultaneously in this way. (Compare this with the rather cumbersome method of complementation that is used in Program 4.2.1A.) It is worth noting that register A can be zero after the instruction on line 40 has been performed; and, because of the way MIX defines the sign of a zero result, the accumulator contains, the correct sign that is to be attached to the result if register X is nonzero. If lines 39 and 40 were interchanged, the program would be incorrect, even though both instructions are ADD ! Now let us consider double-precision multiplication. The product has four components, shown schematically in Fig. 4. Since we need only the leftmost eight bytes, it is convenient to work only with the digits to the left of the vertical line in the diagram, and this means in particular that we need not even compute the product of the two least-sigfiificant halves. Program M (Double-precision multiplication). The input and output conventions for this subroutine are the same as for Program A. 01 BYTE EQU l(4: 4) Byte size 0,~ QQ E&U BYTE*BYTE/2 Excess of double-precision exponent 03 DFMUL STJ EXITDF Double-precision multiplication: O-4 STA TEMP 05 SLAX 2 Remove exponent. 06 STA ARG V, 07 STX ARGX Vl 08 LDA TEMP (EXPD) 09 ADD ACC@XPD) IO STA EXPO EXPO + eu + eu. II ENT2 -QQ rI2 +–QQ. I.2 LDA ACC is LDX ACCX 14 SLAX 2 Remove exponent. 15 STA ACC urn 16 STX ACCX W 17 MUL ARGX %7l x Vl 18 STA TEMP 19 LDA ARG(ABS) 20 MUL ACCX(ABS) lvm x WI 21 SRA 1 oxxxx 22 ADD TEMP (I : 4) (Overflow cannot occur) 23 STA TEMP 24 LDA ARC 25 MUL ACC vm x wn 26 STA TE~~P(SIGN) Store true sign of result. 27 STA ACC Now prepare to add all the 28 STX ACCX partial products together.

Web hosting rating - 4.2.3 18 19 20 21 22 23 24

Monday, November 19th, 2007

4.2.3 18 19 20 21 22 23 24 25 26 ,27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 $ 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 2H IH DNORM DZERO 2H IH EXPOVD EXPUND 8H 9H EXITDF ARG ARGX ACC ACCX EXPO STA LDIN LD2 INCI SLAX SRAX STA STX STA LDA LDX SLAX STA SLAX ENTX STX SRC STA ADD SRAX DECA ADD ADD JAtiZ JXNZ STA JMP SLAX DEC2 CMPA JE SRAX STA LDA INCA JAN WA CMPA JL HLT HLT LDA STX JMP CON CON CON CON CON TEMP TEMP (EXPD) ACC (EXPD) 0,2 2 I,1 ARG ARGX ARGX(SIGN) ACC ACCX 2 ACC 4 I EXPO 1 IF (SIGN) ARGX (0 : 4) 4 I ACC (0 : 4) ARC IF 1F ACC 9F 1 1 =0= (1: I) 2B 2 ACC EXPCI 0,2 EXPUND ACC (EXPD) =I (3 : 3) = 8F 20 10 ACC ACCX * 0 0 0 0 0 DOUBLE-PRECISION CALCULATIONS Now ACC has the sign of the answer. rI1 +–e,. r12 +-e,. rIl+e%–ee,. Remove exponent. Scale right. OVl Vz V3 V4 V5 % vu7 V8 v9 Store true sign in both halves. Remove exponent. UlU2 u3u4u5 FXPO +- 1 (see below). 1 u5 u6 u7 ‘%3 A trick, see comments in text. Add 0 ‘Us V6 217 V8. Recover from inserted 1. (Sign varies) Add most significant halves. (Overflow cannot occur) Normalization routine: fw in rAX, e, = EXPO + r12. If fw = 0, set e, + 0. Normalize to left. Is the leading byte zero? (Rounding omitted) Compute final exponent. Is it negative? Is it more than two bytes? Bring answer into rA. Exit from subroutine floating point accumulator Part of raw exponent 1

Web site template - 232 ARITHMETIC 4.2.3 accuracy. Double precision is most

Monday, November 19th, 2007

232 ARITHMETIC 4.2.3 accuracy. Double precision is most often used for intermediate steps during the calculation of single-precision results; its full potential isn t needed. (c) It will be instructive for us to analyze these procedures in order to see how inaccurate they can be, since they typify the types of short cuts generally taken in bad single-precision routines (see exercises 7 and 8) . Let us now consider addition and subtraction operations from this stand- point. Subtraction is, of course, converted to addition by changing the sign of the second operand. Addition is performed by separately adding together the least-significant halves and the most-significant halves, propagating carries appropriately. A difficulty arises, however, since we are doing signed-magnitude arithmetic: it is possible to add the least-significant halves and to get the wrong sign (namely, when the signs of the operands are opposite and the least-significant half of the smaller operand is bigger than the least-significant half of the larger operand). The simplest solution is to anticipate the correct sign; so in step A2 (cf. Algorithm 4.2.1A), we not only assume that eU 2 e,, we also assume that IuI 2 1~1. This means we can be sure that the final sign will be the sign of U. In other respects, double-precision addition is very much like its single-precision counterpart, only everything is done twice. Program A (Double-precision addition). The subroutine DFADD adds a double- precision floating point number V, having the form (l), to a double-precision floating point number u, assuming that v is initially in rAX (i.e., registers A and X), and that u is initially stored in locations ACC and ACCX. The answer appears both in rAX and in (ACC, ACCX). The subroutine DFSUH subtracts 2, from u under the same conventions. Both input operands are assumed to be normalized, and the answer is normalized. The last portion of this program is a double-precision normalization procedure that is used by other subroutines of this section. Exercise 5 shows how to improve the program significantly. 01 ABS EQU I:5 Field definition for absolute value 0.2 SIGN EQU 0:O Field definition for sign 03 EXPD E&U I:2 Double-precision exponent field 04 DFSUH STA TEMP Double-precision subtraction: 05 LDAN TEMP Change sign of V. 06 DFADD STJ EXITDF Double-precision addition: 07 CMPA ACC (AHS) Compare JuI with 1~1. 08 JG IF 09 2F IO :iPX ACCX (AHS) 11 JLE 2F 12 IH STA AHG If 12~1< 1~1, interchange u tt V. 13 STX ARGX 14 LDA ACC 15 LDX ACCX 16 ENTI ACC (ACC and ACCX are in consecutive 17 MOVE ARG(2) locations.)

Web hosting contract - 42.3 DOUBLE-PRECISION CALCULATIONS 231 here. Special techniques apply

Sunday, November 18th, 2007

42.3 DOUBLE-PRECISION CALCULATIONS 231 here. Special techniques apply to double precision that are comparatively inap- propriate for higher precisions; and double precision is a reasonably important topic in its own right, since it is the first step beyond single precision and it, is applicable to many problems that do not require extremely high precision. Double-precision calculations are almost always required for floating point rather than fixed point arithmetic, except perhaps in statistical work where fixed point double-precision is commonly used to calculate sums of squares and cross products; since fixed point versions of double-precision arithmetic are simpler than floating point versions, we shall confine our discussion here to the latter. Double precision is quite frequently desired not only to extend the precision of the fraction parts of floating point numbers, but also to increase the range of the exponent part. Thus we shall deal in this section with the following two-word format for double-precision floating point numbers in the MIX computer: f e e f f f f f f f f. (1) Here two bytes are used for the exponent and eight bytes are used for the fraction. The exponent is excess b2/2, where b is the byte size. The sign will appear in the most significant word; it is convenient to ignore the sign of the other word completely. Our discussion of double-precision arithmetic will be quite machine-oriented, beca.use it is only by studying the problems involved in coding these routines that a person can properly appreciate the subject. A careful study of the MIX programs below is therefore essential to the understanding of the material. In this section we shall depart from the idealistic goals of accuracy stated in the previous two sections; our double-precision routines will not round their results, and a little bit of error will sometimes be allowed to creep in. Users dare not trust these routines too much. There was ample reason to squeeze out every possible drop of accuracy in the single-precision case, but now we face a different situation: (a) The extra programming required to ensure true double-precision rounding in all cases is considerable; fully accurate routines would take, say, twice as much space and half again as much more time. It was comparatively easy to make our single-precision routines perfect, but double precision brings us face to face with our machine s limitations. (A similar situation occurs with respect to other floating point subroutines; we can t expect the cosine routine to compute round(coss) exactly for all CC, since that turns out to be virtually impossible. Instead, the cosine routine should provide the best relative error it can achieve with reasonable speed, for all reasonable values of 5. Of course, the designer of the routine should try to make the computed function satisfy simple mathematical laws whenever possible (e.g., @(-X) = @z, [@XI 5 1, and @CC 2 @ y for 0 5 z 5 y < n).) (b) Single-precision arithmetic is a staple food that everybody who wants to employ floating point arithmetic must use, but double precision is usually for situations where such clean results aren t as important. The difference between seven-and eight-place accuracy can be noticeable, but we rarely care about the difference between 15- and 16-place

Php web hosting - 230 ARITHMETIC 4.2.2 24. [A&?71 Consider the set

Saturday, November 17th, 2007

230 ARITHMETIC 4.2.2 24. [A&?71 Consider the set of all intervals [UL, u7], where 2~1 and 1~~ are either nonzero floating point numbers or the special symbols +O, -0, +co, –co; each interval must have 1~1 2 uI, and ~1 = uI is allowed only when ~1 is finite and nonzero. The interval [UL, u,] stands for all floating point z such that ~1 5 z 5 Us, where we regard -co<-x<-o<+o<+x<+co for all positive x. (Thus, [l, 21 means 1 5 x 5 2; [+0, l] means 0 < x 5 1; [-0, l] means 0 2 z 5 1; [-0, +O] denotes the single value 0; and [-co, +oo] stands for everything.) Show how to define appropriate arithmetic operations on all such intervals, without resorting to overflow or underflow or other anomalous indications except when dividing by an interval that includes zero. b 25. [15] When people speak about inaccuracy in floating point arithmetic they often ascribe errors to cancellation that occurs during the subtraction of nearly equal quantities. But when u and w are approximately equal, the difference u 8 21 is obtained exactly, with no error. What do these people really mean? 26. [HA4301 (H. G. Diamond.) Suppose f(x) is a strictly increasing function on some interval (x0, xl], and let g(x) be the inverse function. (For example, f and g might be exp and ln , or tan and arctan .) If x is a floating point number such that x0 5 x 2 x1, let f(x) = round(f(x)), and if y is a floating point number such that f(xo) 5 y <_ f(xl), let Q(y) = round(g(y)); furthermore, let h(x) = g(f(x)), whenever this is defined. Although h(x) won t always be equal to x, due to rounding, we expect h(x) to be near x. Prove that if the precision bP is at least 3, and if f is strictly concave or strictly convex (i.e., f (x) < 0 or f (x) > 0 for all x in [x0, xl]), then repeated application of h will be stable in the sense that WV+))) = W(x)), for all x such that both sides of this equation are defined. In other words, there will be no drift if the subroutines are properly implemented. b 27. [A4.25] (W. M. Kahan.) Let f(z) = 1 + x + . . . f x106 = (1 -x 07)/(1 -x) for x < 1, and let g(y) = f(($ -y2)(3 + 3.45~~)) for 0 < y < 1. Evaluate g(y) on one or more pocket calculators, for y = 10V3, 10e4, 10V5, 10P6, and explain all inaccuracies in the results you obtain. (Since most present-day calculators do not round correctly, the results are often surprising. Note that g(E) = 107 -10491.35e2 + O(c4).) *4.2.3. Double-Precision Calculations Up to now we have considered single-precision floating point arithmetic, which essentially means that the floating point values we have dealt with can be stored in a single machine word. When single-precision floating point arithmetic does not yield sufficient accuracy for a given application, the precision can be increased by suitable programming techniques that use two or more words of memory to represent each number. Although we shall discuss the general question of high-precision calculations in Section 4.3, it is appropriate to give a separate discussion of double-precision

4.2.2 ACCURACY OF FLOATING POINT ARITHMETIC 229 14. (Web hosting servers)

Saturday, November 17th, 2007

4.2.2 ACCURACY OF FLOATING POINT ARITHMETIC 229 14. [MZ7] Find a suitable E such that (U @ w) @ w Z=Z u @ (v @ w) (E), when unnormalized multiplication is being used. (This generalizes (39), since unnormalized multiplication is exactly the same as normalized multiplication when the input operands U, w, and w are normalized.) b 15. [A4Z4] (H. Bj6rk.) Does the computed midpoint of an interval always lie between the endpoints? (In other words, does u 5 21 imply that u 5 (U $ V) @ 2 5 v?) 16. [A&%] (a) What is (… ((x1 $ ~2) $ 2s) @ … $ 2,) when n = lo6 and xk = 1.1111111 for all k, using eight-digit floating decimal arithmetic? (b) What happens when Eq. (14) is used to calculate the standard deviation of these particular values xk? What happens when Eqs. (15) and (16) are used instead? (c) Prove that Sk 2 0 in (16), for all choices of x1, . . . , zk. 17. [%I Write a MIX subroutine, FCMP, that compares the floating point number u in location ACC with the floating point number v in register A, and that sets the comparison indicator to LESS, EQUAL, or GREATER, according as IL < U, u -V, or u > w (6); here E is stored in location EPSILON as a nonnegative fixed point quantity with the decimal point assumed at the left of the word. Assume normalized inputs. 18. [M40] In unnormalized arithmetic is there a suitable number E such that b 19. [MSO] (W. M. Kahan.) Consider the following procedure for floating point sum- mation of x1, . . . , xn: so = co = 0; yk = xk 8 ck-1, Sk = Sk-1 $ yk, ck = (Sk @ Sk-l) 8 yk, for 1 5 k 5 n. Let the relative errors in these operations be defined by the equations yk = (xk -ck-l)(l + vk), Sk = (Sk-1 + yk)(l + ok), ck = ((Sk -Sk-l)(l +-/k) -yk)(l + Sk), where Iqkl, lokl, lykl, 16kl 5 t. Prove that Sn = c l