Archive for December, 2007

4.3.3 HOW FAST CAN WE (Web hosting mysql) MULTIPLY? 291 Pass

Monday, December 31st, 2007

4.3.3 HOW FAST CAN WE MULTIPLY? 291 Pass 0. Let A[Ol(tk-1, . . . , to) = ut, where t = (tk-I . . . to)z. Pass 1. Set A[~](s~-I, tk-2,. . . , to) + A[ ](O, t,+2,. . . , to) + w(sk- o~~~o)2 . A[O](l, t&s,. . . , to). PaSS 2. Set Ai2](Sk-l, Sk-Z, tk-3,. . . , to) + Ail](S,z-l, 0, tk-3,. . . , to) + W(Sk-2Sk- o 0)2 d](Sk-1, 1, t&3,. . . , to). . . . Pass k. Set A[ ](Sk-1, . . . , Sl, So) + A[k-l](Sk-l,. . . , sl, 0) + W(SoS - Sk- )2 . Aik- ](Sk-l,. . . , Sl, 1). It is fairly easy to prove by induction that we have &l(Sk-l,. . . , Sk-j, tk-j-l,. . . , to) -W(S~S~…Sk- )2.(tk-l…tk-~o…0)2 Ut, (33) c O

Web hosting contract - 290 ARITHMETIC 4.3.3 The reader will find it

Sunday, December 30th, 2007

290 ARITHMETIC 4.3.3 The reader will find it instructive to study the ingenious method represented by (30) and (31) very carefully. Similar techniques are discussed in Section 4.6.3. SchGnhage s paper [Computing 1(1966), 182-1961 shows that these ideas can be extended to the multiplication of n-bit numbers using T z 2G moduli, obtaining a method analogous to Algorithm C. We shall not dwell on the details here, since Algorithm C is always superior; in fact, an even better method is next on our agenda. C. Use of discrete Fourier transforms. The critical problem in high-precision multiplication is the determination of convolution products such as urvo + %-1Ul + * *. + Uo%, and there is an intimate relation between convolutions and an important math- ematical concept called Fourier transformation. If w = exp(27ri/K) is a Kth root of unity, the one-dimensional Fourier transform of the sequence of com- plex numbers (uo, ul,. . . , UK–~) is defined to be the sequence (CO, Q1, . . . ,+&-I), where ii, = c wstut , O

4.3.3 HOW FAST CAN (Geocities web hosting) WE MULTIPLY? 289 The

Sunday, December 30th, 2007

4.3.3 HOW FAST CAN WE MULTIPLY? 289 The above observations leave us with the following problem to solve: Given positive integers e < f and a nonnegative integer u < 2f, compute the value of (cu)mod(2f-1), w h ere c is the number such that (ae - l)c -1 (modulo 2f -1); and the computation must be done in O(flogf) cycles. The result of exercise 4.3.2-6 gives a formula for c that suggests a suitable procedure. First we find the least positive integer b such that be - 1 (modulo f). (28) This can be done using Euclid s algorithm in O((logf)3) cycles, since Euclid s algorithm applied to e and f requires O(logf) iterations, and each iteration requires O((logf)2) c es; alternatively, we could be very sloppy here without cy 1 violating the total time constraint, by simply trying b = 1, 2, etc., until (28) is satisfied, since such a process would take O(flogf) cycles in all. Once b has been found, exercise 4.3.2-6 tells us that c=c[b]=(0~~2~ )mod(2f-i). (29) - A brute-force multiplication of (cu) mod (2f -1) would not be good enough to solve the problem, since we do not know how to multiply general f-bit numbers in O(flogf) cycles. But the special form of c provides a clue: The binary representation of c is composed of bits in a regular pattern, and Eq. (29) shows that the number c[2b] can be obtained in a simple way from c[b]. This suggests that we can rapidly multiply a number u by c[b] if we build c[b]u up in lg b steps in a suitably clever manner, such as the following: Let the binary notation for b be b = (b, . . . bAh)2; we may calculate the sequences ak, dk, uk, ?& defined by the rUleS a0 = e, ak = 2ak-l mod f; do = hoe, dk = (h-1 + h ak) mod f; uo = u, uk = (U&l + 2ak-1uk-l) mod (af -1); (30) v. = bou, vk = (vk-1 + bk 2dkp1uk) mod (2f -1). It is easy to prove by induction on Ic that ak = (2 e)mod f; uk = (c[2k]U)mod(2f -1); dk = ((bk.. blbo)z e)mod f; vk = (c[(bk . . blbo)2]u) mod (af -1). (31) Hence the desired result, (c[b]u) mod (2f -l), is 21,. The calculation of ak, dk, uk, vk from ak-11 dk-1, uk-1, vk-1 takes o(l% f )+o(h? f )+o(f )+0(f) = o(f) cycles, and therefore the entire calculation can be done in s O(f) = 0( f log f) cycles as desired.

288 ARITHMETIC 4.3.3 This leaves us with operation (Web server address)

Saturday, December 29th, 2007

288 ARITHMETIC 4.3.3 This leaves us with operation (d), which seems to be quite a difficult computation; but Schiinhage has found an ingenious way to perform step (d) in O(pk logpk) cycles, and this is the crux of the method. As a consequence, we have tk = 6tk–1 + o(pk l%Pk). Since pk = 3k+2 + 17, we can show that the time for n-bit multiplication is T(n) = O(N loga 6) = o(~1.63). (23) (See exercise 7.) Although the modular method is more complicated than the O(n1z3) pro- cedure discussed at the beginning of this section, Eq. (23) shows that it does, in fact, lead to an execution time substantially better than O(n2) for the mul- tiplication of n-bit numbers. Thus we can improve on the classical method by using either of two completely different approaches. Let us now analyze operation (d) above. Assume that we are given a set of positive integers el < ez < . . < e,, relatively prime in pairs; let ml = 2e -1, 77x2 = 2ez -1, . . .) m, = 2=T-1. (24) We are also given numbers WI, . . . , w7 such that 0 2 Wj 2 mj. Our job is to determine the binary representation of the number w that satisfies the conditions 0 2 w < mlm2...m,, w - WI (modulo ml), . . . , w G wr (modulo m,). (25) The method is based on (23) and (24) of Section 4.3.2. First we compute W> = (. . . ((Wj -Wi) Clj -W ,) Cq -+ * * -~5~~) c(j-l)j modmj, (26) for j = 2, . . . , r, where w: = wr modml; then we compute w=(.. . (w’,m,-1 + wGml) mT-2 +. . * + 4) ml + 4. (27) Here Cij is a number such that cijmi E 1 (modulo my); these numbers Cij are not given, they must be determined from the ej s. The calculation of (26) for all j involves (3 additions modulo mj, each of which takes O(e,) cycles, plus (i) multiplications by Cij, modulo mj. The calculation of w by formula (27) involves T additions and T multiplications by mj; it is easy to multiply by mj, since this is just adding, shifting, and subtracting, so it is clear that the evaluation of Eq. (27) takes O(r2e,) cycles. We will soon see that each of the multiplications by cij, modulo mj, requires only O(e, log e,) cycles, and therefore it is possible to complete the entire job of conversion in O(r2e, loge,) cycles.

4.3.3 HOW FAST CAN WE MULTIPLY? 287 B. (Web hosting support)

Friday, December 28th, 2007

4.3.3 HOW FAST CAN WE MULTIPLY? 287 B. A modular method. There is another way to multiply large numbers very rapidly, based on the ideas of modular arithmetic as presented in Section 4.3.2. It is very hard to believe at first that this method can be of advantage, since a multiplication algorithm based on modular arithmetic must include the choice of moduli and the conversion of numbers into and out of modular representation, besides the actual multiplication operation itself. In spite of these formidable difficulties, A. Schijnhage discovered that all of these operations can be carried out quite rapidly. In order to understand the essential mechanism of Schonhage s method, we shall look at a special case. Consider the sequence defined by the rules 40 = 1, qk+l = 3qk -1, (20) SO that qk = 3 -3k-1 -… -1 = B(3k j- 1). We will study a procedure that multiplies (18qk + 8) -bit numbers, in terms of a method for multiplying (18qk-i + 8) -bit numbers. Thus, if we know how to multiply numbers having (18qo + 8) = 26 bits, the procedure to be described will show us how to multiply numbers of (18ql + 8) = 44 bits, then 98 bits, then 260 bits, etc., t.w&,ually increasing the number of bits by almost a factor of 3 at each step. Let pk = 18qk + 8. When multiplying &-bit numbers, the idea is to use the six moduli ml = pa-1 -1 , m2 = 26a+l -1 9 m3 = 2%+2 -1 m4 = 2@+3 -1 7 m5 = pa+5 -1 , m6 = 26~+7 -1. (21) These moduli are relatively prime, by Eq. 4.3.2-18, since the exponents %7/c–1, 640+1, 6%+2, 6qk+3, 6qk+5, 6qkf7 (22) are always relatively prime (see exercise 6). The six moduli in (21) are capable of representing numbers up to m = m1m2msm4m5ms > 236qk+16 = 22Pk, so there is no chance of overflow in the multiplication of pk-bit numbers u and v. Thus we may use the following method, when k > 0: a) Compute ~1 = Umodml, . . . . us = umodms; and vi = vmodmi, . . . , 216 = V mod ms. b) Multiply u1 by vl, u2 by v2, . . . , us by vs. These are numbers of at most 6qk +7 = 18qk-l+ 1 < pk-1 bits, so the multiplications can be performed by using the assumed pk-l-bit multiplication procedure. C) COmpUte w1 = uivi modmi, ws = 212212modms, . . . , ws = u6v6 mod ms. d)ComputewsuchthatO~w

Web hosting provider - 286 ARITHMETIC 4.3.3 Thus there is a constant

Thursday, December 27th, 2007

286 ARITHMETIC 4.3.3 Thus there is a constant c such that tk i c&k&k + (2rk + l)tk-1. To complete the estimation of tk we can prove by brute force that for some constant C. Let us choose C > 2Oc, and let us also take C large enough so that (18) is valid for k < kc, where kc will be specified below. Then when k > ko, let &k = lgqk, & = lgrk; we have by induction tk 5 CqkT; lgrk + (2?-k + 1)cqk22 5G = cqk+122.5d==(771 + r12), where ql = :Rk2Rk-2.5fi < $jRk2-R* < 0.05, c r/s = a+; 22.5(G-&zTd + 2-l/4 < 085 , . ( > since as k + co. It follows that we can find ko such that ~2 < 0.95 for all k > k,,, and this completes the proof of (18) by induction. Finally, therefore, we may compute r(n). Since n > q-1 + qk-2, we have qk-1 < n; hence Tkpl = 2lGJ < 26, and qk =rk-lqk-1 < n2 &G , Thus and, since T(n) = o(qk) + t&-l, we have finally derived the following theorem: Theorem C. There is a constant cc such that the execution time of Algorithm C is less than con23.5fi cycles. 1 Since n23.56 = n1+3.5/& th is result is noticeably stronger than Theorem A. By adding a few complicaiions to the algorithm, pushing the ideas to their apparent limits (see exercise 5), we can improve the estimated execution time to T(n) = O(n2fi logn). (19)

4.3.3 HOW FAST (Photoshop web design) CAN WE MULTIPLY? 285 C7.

Thursday, December 27th, 2007

4.3.3 HOW FAST CAN WE MULTIPLY? 285 C7. [Find Ps.] Set r + rk, q + ok, p + ok-1 + ok. (At this point stack W contains a sequence of numbers ending with W(O), W(l), . . . , W(2r) from bottom to top, where each W(j) is a 2pbit number.) Now for j = 1, 2, 3, . . . , 2r, perform the following loop: For t = 2r, 2r -1, 2r -2, . . . , j, set W(t) c (W(t) -W(t -l))/j. (Here j must increase and t must decrease. The quantity (W(t) -W(t -l))/j will always be a nonnegative integer that fits in 2p bits; cf. (15).) C8. [Find w s.] For j = 2r -1, 2r -2, . . . , 1, perform the following loop: For t=j,j+1,…,2r-l,setW(t)cW(t)-jW(t+l). (Herejmust decrease and t must increase. The result of this operation will again be a nonnegative 2pbit integer; cf. (l f).) C9. [Set answer.] Set w to the 2(ql, + qk+r)-bit integer (. . . (W(2r)29 + W(2r -1))2 +. . . + W(l))29 + W(0). Remove W(2r), . . . , W(0) from stack W. ClO. [Return.] Set k c Ic + 1. Remove the top of stack C. If it is code-3, go to step C6. If it is code-2, put w onto stack W and go to step C7. And if it is code-l, terminate the algorithm (w is the answer). 1 Let us now estimate the running time, T(n), for Algorithm C, in terms of some things we shall call cycles, i.e., elementary machine operations. Step Cl takes O(qk) cycles, even if we represent the number qk internally as a long string of qk bits followed by some delimiter, since qk + ok-1 + . . . + qo will be O(qk). Step C2 obviously takes O(qk) cycles. Now let tk denote the amount of computation required to get from step C3 to step Cl0 for a particular value of Ic (after Ic has been decreased at the beginning of step C3). Step C3 requires O(q) cycles at most. Step C4 involves r multiplications of p-bit numbers by (lg2r)-bit numbers, and r additions of pbit numbers, all repeated 4r + 2 times. Thus we need a total of O(r2qlogr) cycles. Step C5 requires moving 4r + 2 p-bit numbers, so it involves O(rq) cycles. Step C6 requires O(q) cycles, and it is done 2r + 1 times per iteration. The recursion involved when the algorithm essentially invokes itself (by returning to step C3) requires tkyl cycles, 2r + 1 times. Step Ci requires O(r2) subtractions of p-bit numbers and divisions of 2pbit by (lg 2r)-bit numbers, so it requires O(r2q log r) cycles. Similarly, step C8 requires O(r2q logr) cycles. Step C9 involves O(rq) cycles, and Cl0 takes hardly any time at all. Summing up, we have T(n) = O(qk) + O(qk) + t&-l, where (if q = qk and r = rk) the main contribution to the running time satisfies tk = o(q) + o(r2q hr) + o(v) + (2r t 1)0(q) + O(r2q 1% r) + O(r2q log r) + O(w) + O(q) + ( Jr + l)tk-1 = o(r2q logr) + (2r + l)tk-l.

284 ARITHMETIC 4.3.3 C4. Break into Fig. 8. (Free web hosting music)

Wednesday, December 26th, 2007

284 ARITHMETIC 4.3.3 C4. Break into Fig. 8. Toom-Cook algorithm for high-precision multiplication. C4. [Break into r + 1 parts.] Let the number at the top of stack C be regarded as a list of r + 1 numbers with q bits each, (U, . . . UrUo)a4. (The top of stack C now contains an (r + 1)q = (qk + qk+l)-bit number.) For j = 0, 17 **, 2r, compute the pbit numbers (. * f (UTj + UT-1)j + * * * + Ul)j + uo = U(j) and successively put these values onto stack U. (The bottom of stack U now contains U(O), then comes U(l), etc., with U(2r) on top. Note that U(j) 5 U(2r) < 24((2+ + (2r)+l + . * * + 1) < 24+72+ 2 2p, by exercise 3.) Then remove U, . . . UIUo from stack C. Now the top of stack C contains another list of r + 1 q-bit numbers, V7 . . . VIVi, and the pbit numbers (. . .(Kj + K-1)j + * * . + h)j + vo = V(j) should be put onto stack V in the same way. After this has been done, remove V, . . . VIVO from stack C. C5. [Recurse.] Successively put the following items onto stack C, at the same time emptying stacks U and V: code-2, V(2r), U(2r), code-3, V(2r -l), U(2r -l), . . . . code-3, V(l), U(l), code-3, V(O), U(0). Go back to step C3. C6. [Save one product.] (At this point the multiplication algorithm has set w to one of the products W(j) = U(j)V(j).) Put w onto stack W. (This number w contains 2(qk + qk-1) bits.) Go back to step C3.

4.3.3 HOW FAST CAN WE MULTIPLY? (Web site construction) 283 Algorithm

Tuesday, December 25th, 2007

4.3.3 HOW FAST CAN WE MULTIPLY? 283 Algorithm C (High-precision multiplication of binary numbers). Given a positive integer n and two nonnegative n-bit integers u and 21, this algorithm forms their 2n-bit product, 20. Four auxiliary stacks are used to hold the long numbers that are manipulated during the procedure: Stacks U, V: Temporary storage of U(j) and V(j) in step C4. Stack C: Numbers to be multiplied, and control codes. Stack W: Storage of W(j). These stacks may contain either binary numbers or special control symbols called code-l, code-a, and code-3. The algorithm also constructs an auxiliary table of numbers qk, rk; this table is maintained in such a manner that it may be stored as a linear list, where a single pointer that traverses the list (moving back and forth) may be used to access the current table entry of interest. (Stacks C and W are used to control the recursive mechanism of this multi- plication algorithm in a reasonably straightforward manner that is a special case of general procedures discussed in Chapter 8.) Cl. [Compute 4, r tables.] Set stacks U, V, C, and W empty. Set k c 1, qo + 41 + 16, ro t r1 t 4, Q + 4, R + 2. Now if qk-1 + qk < n, set kcIc+l, Q+Q+R R+lJ&], qkCsQ, rktsR, and repeat this operation Until q-1 + qk 2 n. (iVOi%: The CdCUkttiOU of R t [flj does not require a square root to be taken, since we may simply set R +- R + 1 if (R + 1)2 5 Q and leave R unchanged if (R + 1)2 > Q; see exercise 2. In this step we build the sequences k=O 1 2 3 4 5 6 . . . qk = 24 24 26 2 210 213 216 rk = 22 22 22 22 23 23 24 1:: The multiplication of 70000-bit numbers would cause this step to terminate with k = 6, since 70000 < 213 + 216.) C2. [Put U, v on stack.] Put code-l on stack C, then place u and v onto stack C as numbers of exactly q&l + qk bits each. C3. [Check recursion level.] Decrease k by 1. If k = 0, the top of stack C now contains two 32-bit numbers, u and v; remove them, set w +-uv using a built-in routine for multiplying 32-bit numbers, and go to step ClO. If k > 0, Set 7 + ?-k, g + qk, p +- qk-1 + qk, and g0 Oil t0 Step c4.

282 ARITHMETIC 4.3.3 The leftmost column of this (Web space)

Tuesday, December 25th, 2007

282 ARITHMETIC 4.3.3 The leftmost column of this tableau is a listing of the given values of W(O), W(l), , W(4); the Icth succeeding column is obtained by computing the difference between successive values of the preceding column and dividing by. k. The coefficients 0, appear at the top of the columns, so that 130= 10, @I = 294, ** I 84 = 36, and we have W(z)=36&+341xc3+691×2+294&+10 = (((36(x -3) + 341)(x -2) + 691)(x -1) + 294)x + 10. (16) In general, we can write w(x) = (. . . ((f3,4~–m+2) + em-2)(x-m+3) + k3)(2-m+4) +-.+el)~+eo, and this formula shows how the coefficients M&-l, . . . , WI, Wo can be obtained from the e s: 36 341 -3.36 36 233 691 -2.36 -2 a233 (17) 36 161 225 294 -1.36 -1 . 161 -1.225 36 125 64 69 10 Here the numbers below the horizontal lines successively show the coefficients of the polynomials th, h4x + m + 2) + em-2, (emml(x -m + 2) + emp2)(x -77~ + 3) + emp3, etc. From this tableau we have W(z)=36×4+125×3+64×2+69x+10, so the answer to our original problem is 1234 . 2341 = W(16), where W(16) is obtained by adding and shifting. A generalization of this method for obtaining coefficients is discussed in Section 4.6.4. The basic Stirling number identity, n xn = x~+-.+{;}x~+{;}, 0n Eq. 1.2.6-41, shows that if the coefficients of W(x) are nonnegative, so are the numbers tl,, and in such a case all of the intermediate results in the above com- putation are nonnegative. This further simplifies the Toom-Cook multiplication algorithm, which we will now consider in detail.