Archive for July, 2007

76 RANDOM NUMBERS (Web hosting mysql) 3.3.3 Let us begin with

Thursday, July 26th, 2007

76 RANDOM NUMBERS 3.3.3 Let us begin with a proof of a simple a priori law, for the least complicated case of the permutation test. The gist of our first theorem is that we have X,+1 < X, about half the time, provided that the sequence has high potency. Theorem P. Let a, c, and m generate a linear congruential sequence with maximum period; let b = a - 1 and let d be the greatest common divisor of m and b. The probability that X,+1 < X, is equal to 3 + r, where r = (2(c mod d) -d)/2m; (1) hence Irl < d/2m. Proof. The proof of this theorem involves some techniques that are of interest in themselves. First we define s(x) = (aa: + c) mod m. (2) Thus, &+I = s(Xn), and the theorem reduces to counting the number of integers 2 such that 0 < z < m and s(z) < z (since each such integer occurs somewhere in the period). We want to show that this number is $(m + 2(cmodd) -d). (3) The function [(z -s(z))/ml is equal to 1 when z > s(z), and it is 0 otherwise; hence the count we wish to obtain can be written simply as (4) (Recall that r—y] = -[yJ and b = a -1.) Such sums can be evaluated by the method of exercise 1.2.4-37, where we have proved that g = w@, k), (5) whenever h and k are integers and k > 0. Since a is relatively prime to m, this formula yields o<~m[EgJ = @-l)p–) +(+ - o<~~~~~J=(b-l)~-l)+~+C-(emodd), - and (3) follows immediately. 1

Web server info - 3.3.3 THEORETICAL TESTS 75 (b) Let C =

Thursday, July 26th, 2007

3.3.3 THEORETICAL TESTS 75 (b) Let C = N/D, where N and D denote the numerator and denominator of the expression in part (a). Show that N2 < D2, hence -1 5 C 5 1; and obtain a formula for the difference 0 -N2. [Hint: See exercise 1.2.3-30.1 (C) If C = &I, show that oxk + PYk = 7, 0 < k < n, for some constants cy, p, and T, not all zero. 18. [A&%] (a) Show that if n = 2, the serial correlation coefficient (23) is always equal to -1 (unless the denominator is zero). (b) S imilarly, show that when n = 3, the serial correlation coefficient always equals -4. (c) Show that the denominator in (23) is zero if and only if Uo = UI =. = U,-I. 19. [M40] What are the mean and standard deviation of the serial correlation coeffi- cient (23) when n = 4 and the U s are independent and uniformly distributed between zero and one? 20. [M47] Find the distribution of the serial correlation coefficient (23), for general n, assuming that the U, are independent random variables uniformly distributed between zero and one. 21. [I91 What value of f is computed by Algorithm P if it is presented with the permutation (1,2,9,8,5,3,6,7,0,4)? 22. [18] For what permutation of (0, 1,2,3,4,5,6,7,8,9} will Algorithm P produce the value f = 1024? *3.3.3. Theoretical Tests Although it is always possible to test a random number generator using the methods in the previous section, it is far better to have a priori tests, i.e., theoretical results that tell us in advance how well those tests will come out. Such theoretical results give us much more understanding about the generation methods than empirical, trial-and-error results do. In this section we shall study the linear congruential sequences in more detail; if we know what the results of certain tests will be before we actually generate the numbers, we have a better chance of choosing a, m, and c properly. The development of this kind of theory is quite difficult, although some progress has been made. The results obtained so far are generally for statistical tests made over the entire period. Not all statistical tests make sense when they are applied over a full period-for example, the equidistribution test will give results that are too perfect-but the serial test, gap test, permutation test, maximum test, etc. can be fruitfully analyzed in this way. Such studies will detect global nonrandomness of a sequence, i.e., improper behavior in very large samples. The theory we shall discuss is quite illuminating, but it does not eliminate the need for testing local nonrandomness by the methods of Section 3.3.2. Indeed, it appears to be extremely hard to prove anything useful about short subsequences. Only a few theoretical results are known about the behavior of linear congruential sequences over less than a full period; these will be discussed at the end of Section 3.3.4. (See also exercise 18.)

74 RANDOM NUMBERS 3.3.2 7. [08] Apply the (Msn web hosting)

Wednesday, July 25th, 2007

74 RANDOM NUMBERS 3.3.2 7. [08] Apply the coupon collector s test procedure (Algorithm C) with d = 3 and 72 = 7, to the following sequence: 1101221022120202001212201010201121. What lengths do the seven subsequences have? b 8. [MZ?] How many U s need to be examined, on the average, in the coupon collec- tor s test (Algorithm C) before 72 complete sets have been found, assuming that the sequence is random? What is the standard deviation? [Hint: See Eq. 1.2.9-28.1 9. [A4,%?1] Generalize the coupon collector s test so that the search stops as soon as w distinct values have been found, where w is a fixed positive integer less than or equal to d. What probabilities should be used in place of (6)? 10. [A&% ] Solve exercise 8 for the more general coupon collector s test described in exercise 9. 11. [OO] The runs up in a particular permutation are displayed in (9); what are the runs down in that permutation? 12. [ZOO] Let V0, Ul, . . . , V,-l be n distinct numbers. Write an algorithm that determines the lengths of all ascending runs in the sequence. When your algorithm terminates, COUNT[r] should be the number of runs of length r, for 1 5 r 5 5, and COUNF[S] should be the number of rdns of length 6 or more. 13. [MZ?] Show that (16) is the number of permutations of p+q+l distinct elements having the pattern (15). b 14. [A&5] If we throw away the element that immediately follows a run, so that when Xj is greater than X3+1 we start the next run with X3+2, the run lengths are independent, and a simple chi-square test may be used (instead of the horribly compli- cated method derived in the text). What are the appropriate run-length probabilities for this simple run test? 15. [A4101 In the maximum-of-t test, why are Vi, V, , . . . , VAwl supposed to be uniformly distributed between zero and one? b 16. [IS] (a) Mr. J. H. Quick (a student) wanted to perform the maximum-of-t test for various values of t. Letting Zj, = max(Vj, U3+1,. . . , Uj+t–l), he found a clever way to go from the sequence ZO(~-~), Z1(t-l), , to the sequence Zot, Zlt, . . , using very little time and space. What was his bright idea? (b) He decided to modify the maximum-of-t method so that the jth observation would be max(u, , . . . , Uj+,-,); in other words, he took V, = Zj, instead of V, = Z ctjjt as the text says. He reasoned that all of the Z s should have the same distribution, so the test is even stronger if each ZJt, 0 5 j < n, is used instead of just every tth one. But when he tried a chi-square equidistribution test on the values of V$, he got extremely high values of the statistic V, which got even higher as t increased. Why did this happen? 17. (A4Z5] (a) Given any numbers UO, . . , Un-l, Vi,. . . , Vn-l, let 1 ai=- c uk, v=-; c vk. n O

3.3.2 EMPIRICAL TESTS 73 unpublished], is unlike the (Web site development)

Wednesday, July 25th, 2007

3.3.2 EMPIRICAL TESTS 73 unpublished], is unlike the others in that it was not developed before the advent of computers; it is specifically intended for computer use. The reader probably wonders, Why are there so many tests? It has been said that more computer time is spent testing random numbers than using them in applications! This is untrue, although it is possible to go overboard in testing. The need for making several tests has been amply documented. It has been recorded, for example, that some numbers generated by a variant of the middle- square method have passed the frequency test, gap test, and poker test, yet flunked the serial test. Linear congruential sequences with small multipliers have been known to pass many tests, yet fail on the run test because there are too few runs of length one. The maximum-of-t test has also been used to ferret out some bad generators that otherwise seemed to perform respectably. Perhaps the main reason for doing extensive testing on random number generators is that people misusing Mr. X s random number generator will hardly ever admit that their programs are at fault: they will blame the generator, until Mr. X can prove to them that his numbers are sufficiently random. On the other hand, if the source of random numbers is only for Mr. X s personal use, he might decide not to bother to test them, since the techniques recommended in this chapter have a high probability of being satisfactory. EXERCISES 1. [10] Why should the serial test described in part B be applied to (Yo, Yl), (Yz, Ys), . , (Ygnp2,Yzn–1) instead of to (Yo, Yl), (Yl, Yz), . . . , (Y,-,, Y,)? 2. [IO] State an appropriate way to generalize the serial test to triples, quadruples, etc., instead of pairs. b 3. [MX?] How many U s need to be examined in the gap test (Algorithm G) before n gaps have been found, on the average, assuming that the sequence is random? What is the standard deviation of this quantity? 4. [&?I Prove that the probabilities in (4) are correct for the gap test. 5. [Mz?] The classical gap test used by Kendall and Babington-Smith considers the numbers UO, U1, , UN-1 to be a cyclic sequence with UN+3 identified with U,. Here N is a fixed number of U s that are to be subjected to the test. If n of the numbers UO, , UN-~ fall into the range cr 5 U, < /3, there are n gaps in the cyclic sequence. Let 2, be the number of gaps of length r, for 0 5 r < t, and let Zt be the number of gaps of length 2 t; show that the quantity V = CoCVCt(Zr -np,) /np, should have the chi-square distribution with t degrees of freedom, in the limit as N goes to infinity, where p, is given in Eq. (4). 6. [40] (H. Geiringer.) A frequency count of the first 2000 decimal digits in the representation of e = 2.71828.. gave a x2 value of 1.06, indicating that the actual frequencies of the digits 0, 1, . , 9 are much too close to their expected values to be considered randomly distributed. (In fact, x2 > 1.15 with probability 99.9 percent.) The same test applied to the first 10,000 digits of e gives the reasonable value x2 = 8.61; but the fact that the first 2000 digits are so evenly distributed is still surprising. Does the same phenomenon occur in the representation of e to other bases? [See AMA4 72 (1965), 483-500.1

72 RANDOM NUMBERS 3.3.2 L. Historical remarks and (Business web site)

Tuesday, July 24th, 2007

72 RANDOM NUMBERS 3.3.2 L. Historical remarks and further discussion. Statistical tests arose naturally in the course of scientists efforts to prove or disprove hypotheses about various observed data. The best known early papers dealing with the testing of artificially generated numbers for randomness are two articles by M. G. Kendall and B. Babington-Smith in the Journd of the Royal Statisticd Society 101 (1938), 147-166, and in the supplement to that journal, 6 (1939), 51-61. These papers were concerned with the testing of random digits between 0 and 9, rather than random real numbers; for this purpose, the authors discussed the frequency test, serial test, gap test, and poker test, although they misapplied the serial test. Kendall and Babington-Smith also used a variant of the coupon collector s test; the method described in this section was introduced by R. E. Greenwood in Math. Comp. 9 (1955), l-4. The run test has a rather interesting history. Originally, tests were made on runs up and down at once: a run up would be followed by a run down, then another run up, and so on. Note that the run test and the permutation test do not depend on the uniform distribution of the U s, they depend only on the fact that Vi = Uj occurs with probability zero when i # j; therefore these tests can be applied to many types of random sequences. The run test in primitive form was originated by J. Bienayme [Comptes Rendus 81 (Paris: Acad. Sciences, 1875), 417-4231. S ome sixty years later, W. 0. Kermack and A. G. McKendrick published two extensive papers on the subject (Proc. Royal Society Edinburgh 57 (1937), 228-240, 332-3761; as an example they stated that Edinburgh rainfall between the years 1785 and 1930 was entirely random in character with respect to the run test (although they examined only the mean and standard deviation of the run lengths). Several other people began using the test, but it was not until 1944 that the use of the chi-square method in connection with this test was shown to be incorrect. The paper by H. Levene and J. Wolfowitz in Annals Math. Stat. 15 (1944), 58-69, introduced the correct run test (for runs up and down, alternately) and discussed the fallacies in earlier misuses of that test. Separate tests for runs up and runs down, as proposed in the text above, are more suited to computer application, so we have not given the more complex formulas for the alternate-up-and-down case. See the survey paper by D. E. Barton and C. L. Mallows, Annals Math. Stat. 36 (1965), 236-260. Of all the tests we have discussed, the frequency test and the serial correla- tion test seem to be the weakest, in the sense that nearly all random number generators pass these tests. Theoretical grounds for the weakness of these tests are discussed briefly in Section 3.5 (cf. exercise 3.5-26). The run test, on the other hand, is a rather strong test: the results of exercises 3.3.3-23 and 24 sug- gest that linear congruential generators tend to have runs somewhat longer than normal if the multiplier is not large enough, so the run test of exercise 14 is definitely to be recommended. The collision test is also highly recommended, since it has been especially designed to detect the deficiencies of many poor generators that have unfor- tunately become widespread. This test, which is based on ideas of H. Delgas Christiansen [Inst. Math. Stat. and Oper. Res., Tech. Univ. Denmark (Oct. 1975),

3.3.2 EMPIRICAL TESTS 71 coefficient is (Professional web hosting) not expected

Tuesday, July 24th, 2007

3.3.2 EMPIRICAL TESTS 71 coefficient is not expected to be exactly zero. (See exercise 18.) A good value of C will be between pn -2a, and pn + 2a,, where -1 1 n(n -3) Pn = -on=—n > 2. (25) n-l n-l J n+l We expect C to be between these limits about 95 percent of the time. Equations (25) are only conjectured at this time, since the exact distribution of C is not known when the U s are uniformly distributed. For the theory when the U s have the normal distribution, see the paper by Wilfrid J. Dixon, Annals Math. Stat. 15 (1944), 119-144. Empirical evidence suggests that we may safely use the formulas for the mean and standard deviation that have been derived from the assumption of the normal distribution, without much error; these are the values that have been listed in (25). It is known that lim,,, &(T, = 1; cf. the article by Anderson and Walker, Annals Math. Stat. 35 (1964), 1296-1303, where more general results about serial correlations of dependent sequences are derived. Instead of simply computing the correlation coefficient between the obser- vations ( UO, U1, . . . , Unpl) and their immediate successors (VI, . . . , Unpl, UO), we can also compute it between (UO, U1,. . , Unpl) and any cyclically shifted sequence (U,, . . . , Unpl, UO, . . . , U,-,); the cyclic correlations should be small for 0 < 4 < n. A straightforward computation of Eq. (24) for all 4 would require about n2 multiplications, but it is actually possible to compute all the correlations in only O(n log n) steps by using fast Fourier transforms. (See Section 4.6.4; cf. also L. P. Schmid, CACM 8 (1965), 115.) K. Tests on subsequences. It frequently happens that the external program using our random sequence will call for numbers in batches. For example, if the program works with three random variables X, Y, and 2, it may consistently invoke the generation of three random numbers at a time. In such applications it is important that the subsequences consisting of every third term of the original sequence be random. If the program requires 4 numbers at a time, the sequences uo, uq, u2q,. * * ; ~1,~q+1,~2q+1,…; . . . . uq-1,7Jzq-1, u3q-1,. . . can each be put through the tests described above for the original sequence UO, Ul, u2, . . . . Experience with linear congruential sequences has shown that these derived sequences rarely if ever behave less randomly than the original sequence, unless 4 has a large factor in common with the period length. On a binary computer with m equal to the word size, for example, a test of the subsequences for 4 = 8 will tend to give the poorest randomness for all 4 < 16; and on a decimal computer, g = 10 yields the subsequences most likely to be unsatisfactory. (This can be explained somewhat on the grounds of potency, since such values of 9 will tend to lower the potency.)

70 RANDOM NUMBERS 3.3.2 (Web design careers) Sl. [Initialize.] Set A[j]

Monday, July 23rd, 2007

70 RANDOM NUMBERS 3.3.2 Sl. [Initialize.] Set A[j] e 0 for 0 2 j 5 n; then set A[l] t 1 and j, e ji t 1. Then do step S2 exactly n - 1 times and go on to step S3. S2. [Update probabilities.] (Each time we do this step it corresponds to tossing a ball into an urn; A[j] represents the probability that exactly j of the urns are occupied.) Set jr +ji+l. Thenforjcjl,ji-l,…, jh(inthis order), set A[j] +- (j/mkUl + ((1 + l/m) -(jlm))Ab -11. If ALI has become very small as a result of this calculation, say A[j] < 10m2 , set A[j] + 0; and in such a case, if j = j, decrease ji by 1, or if j = jo increase jo by 1. S3. [Compute the answers.] In this step we make use of an auxiliary table (Tl I r2 f . . * , Ttmax ) = (.Ol, .05, .25, .50, .75, .95, .99, 1.00) containing the specified percentage points of interest. Set p t 0, t t 1, and j c j. -1. Do the following iteration until t = tmax: Increase j by 1, and set p e p + A[j]; then if p > Tt, output n -j -1 and 1 - p (meaning that with probability 1 -p there are at most n -j -1 collisions) and repeatedly increase t by 1 until p 5 Tt. I J. Serial correlation test. We may also compute the following statistic: n(U0Ul + GU2 + . . . + un-2un–l + un-1Uo) -(VII + Ul + . . . + un-1y = n(U$ + UT + .*a+ q-,)-(uo + u1+…+ un-ly . (23) This is the serial correlation coefficient, which is a measure of the amount lJ+l depends on Uj. Correlation coefficients appear frequently in statistics; if we have n quantities uo, Ul, f U,-I and n others Vi, VI, . . . , V,-I , the correlation coefficient between them is defined to be C= n m4w -cc WC vj) (24 J(nCu3 -Euj)2)(nCV~ -(,IEV,)2) All summations in this formula are to be taken over the range 0 5 j < n; Eq. (23) is the special case V, = U(j+l) modn. (Note: The denominator of (24) is zero when UO = VI = … = Un-l or VO = VI = … = V,-1; we exclude this case from discussion.) A correlation coefficient always lies between -1 and +l. When it is zero or very small, it indicates that the quantities Uj and I+ are (relatively speaking) independent of each other, but when the correlation coefficient is fl it indicates total linear dependence; in fact Vj = cx f PiIJj for all j in such a case, for some constants cr and p. (See exercise 17.) Therefore it is desirable to have C in Eq. (23) close to zero. In actual fact, since U&J1 is not completely independent of UIU2, the serial correlation

3.3.2 EMPIRICAL TESTS 69 Suppose (Web site design) we have m

Monday, July 23rd, 2007

3.3.2 EMPIRICAL TESTS 69 Suppose we have m urns and we throw n balls at random into those urns, where m is much greater than n. Most of the balls will land in urns that were previously empty, but if a ball falls into an urn that already contains at least one ball we say that a collision has occurred. The collision test counts the number of collisions, and a generator passes this test if it doesn t induce too many or too few collisions. To fix the ideas, suppose m = 220 and n = 214. Then each urn will receive only one 64th of a ball, on the average. The probability that a given urn will contain exactly Ic balls is pk = (z)mpk(l -m-l)n-k, so the expected number of collisions per urn is ,&>l(k-l)pk = xkBO k&-E,>, pk = n/m-l+po. Since po = (1 - m-l)n = 1 - n/m + (y)mm2 + small& terms, we find that the average total number of collisions taken over all m urns is very slightly less than n2/2m = 128. We can use the collision test to rate a random number generator in a large number of dimensions. For example, when m = 220 and n = 214 we can test the 20-dimensional randomness of a number generator by letting d = 2 and forming 20-dimensional vectors Vj = (Yzo~, Yzo~+~, . . . , Yzo~+~~) for 0 2 j < n. It suffices to keep a table of m = 22o bits to determine collisions, one bit for each possible value of the vector V,; on a computer with 32 bits per word, this amounts to 215 words. Initially all 220 bits of this table are cleared to zero; then for each V,, if the corresponding bit is already 1 we record a collision, otherwise we set the bit to 1. This test can also be used in 10 dimensions with d = 4, and so on. To decide if the test is passed, we can use the following table of percentage points when m = 220 and n = 214: collisions 5 101 108 119 126 134 145 153 with probability .009 .043 .244 .476 .742 .946 .989 The theory underlying these probabilities is the same we used in the poker test, Eq. (5); the probability that c collisions occur is the probability that n - c urns are occupied, namely m(m-l)…(m-n+c+l) n mn 1n-c I Although m and n are very large, it is not difficult to compute these probabilities using the following method: Algorithm S (Percentage points for collision test). Given m and n, this algorithm determines the distribution of the number of collisions that occur when n balls are scattered into m urns. An auxiliary array A[O], A[l], . . . , A[n] of floating point numbers is used for the computation; actually A[j] will be nonzero only for jo 2 j 2 j,, and ~ 1 - ~ 0 will be at most of order log n, so it would be possible to get by with considerably less storage.

68 RANDOM NUMBERS 3.3.2 Form the matrix C

Monday, July 23rd, 2007

68 RANDOM NUMBERS 3.3.2 Form the matrix C of the covariances of the R s; for example, Crs = covar(Rr , Rs), while CIt = covar(Rr , R:). When t = 6, we have C = nC1 + C2 =n + if n > 12. Now form A = (ai?), the inverse of the matrix C, and compute Cl

3.3.2 EMPIRICAL TESTS (Web hosting ecommerce) 67 are (~+q+l)~~q)-(Pll~~l)-(p+~+l)+l (16) ways

Sunday, July 22nd, 2007

3.3.2 EMPIRICAL TESTS 67 are (~+q+l)~~q)-(Pll~~l)-(p+~+l)+l (16) ways to arrange them in the order (15), as shown in exercise 13; and there are (n -p -q -l)! ways to arrange the remaining elements. Thus there are 1 1 times 16 ways in all, and dividing by n! we get the (p+:+&-P-q-1. ( 1 desired formula. From relations (14) a rather lengthy calculation leads to mean(R ,) = mean(Z,, + . . . + Z,,) = (n + l)Pl(P + 111 - (P - 1)/P!, l n, (18) where t = max(p, q), s = p + q, and ^, j(p, q, n) = (n + 1) s(l -pq) + pq 1Y -L)+(Y) (lg) ( (P + lY(q + (s + I)! , (s2 -s - 2)pq -s2 - p2q2 + 1 t. (P + lY(q + l)! . This expression for the covariance is unfortunately quite complicated, but it is necessary for a successful run test as described above. From these formulas it is easy to compute mean = mean(R ,) -mean(R ,+,), covar(Rp, Rb) = covar(RL, Rb) -covar(R ,+, , R ,), (20) covar(Rp, R4) = covar(R,, R ,) -covar(Rp, Rb+,). In Annals Math. Stat. 15 (1944), 163-165, J. Wolfowitz proved that the quantities Rl, R2, . . . , RtF1, R: become normally distributed as n –+ 00, subject to the mean and covariance expressed above; this implies that the following test for runs is valid: Given a sequence of n random numbers, compute the number of runs R, of length p for 1 2 p < t, and also the number of runs Ri of length t or more. Let &I = RI -mean( . . . . &t-l = Rt-1 -mean(Rt-I), &t = R$ -mean( (21)