The random number generator in datagen produces 2.6E36 integers before the seed repeats itself. This is enough data to satisfy a large system test.
Seeds have two cycles. The seeds of the first cycle do not overlap the seeds of the second cycle. When you select an initial seed, you are selecting a starting point in the middle of one of these cycles.
The initialization routine hashes the seed to mask the pattern of the initial seed. All seeds are odd.
Datagen uses only the high order bit from each seed. The pattern of the low order bits in the seed do not affect the performance of datagen.
The tests below are described in:
Donald E. Knuth
Lincoln L. Chao
The algorithms described in this book can be found
in many other textbooks on inferential statistics.
CRC Standard Mathematical Tables
This book contains density functions for various
distributions.
Datagen consistently passes the tests
listed below.
If a test fails, rerun the test to be sure
that the failure is not due to chance.
Many of the tests are compared against a chi-square
table with a 95 percent probability of success.
This means that on the average, 19 out of 20 tests
will pass the chi-square test.
If the results of the chi-square test hover at the
high end of the range, it could mean that the data
is especially chaotic. If the results hover at the
low end of the range, it could mean that the data is
too close to matching the population parameter. Ideally,
the chi-square test should bounce around unpredictably
within the range.
If a test fails, it always has to be determined whether
the test is poorly designed or whether the data is particularly
bad. One way to narrow this down is to throw some bias
into the data. For instance, a normal distribution may
be replaced with a trigonometric, logarithmic, or
exponential distribution. When this happens, the results
show clearly that the data is incorrect.
The coupon collector's test fails when the
parameters are very close, such as:
coupon 13 14 1000
I'm still investigating whether this is due to
the parameters or due to the generator.
Click on the topics below to get more information
about each test. The script, test.sh,
performs all the numeric tests, together.
The two
visual tests are run separately.
The test procedure, test.sh, serves two
purposes:
Independence Test, Chi-square Test
Dependent Probability Test, Chi-square Test
Roll 2 to 8 Dice, Chi-square Test
Binomial Distribution, Chi-square Test
Binomial Distribution, Chi-square Test
Poisson Distribution, Chi-square Test
Poisson Distribution, Chi-square Test
Sample Mean Distribution, Chi-square Test
Difference Between Two Mean Proportions
Difference Between Two Sample Proportions
Powerball Lottery, Chi-square Test
52 Card Poker, Chi-square Test
The following programs compare the datagen
random number generator to other widely accepted
random number generators.
The Art of Computer Programming
Volume 2, Chapter 3.3.1, General Test Procedures
Volume 2, Chapter 3.3.2, Empirical Tests
Statistics for Management
Palo Alto, CA: The Scientific Press, 1984
Second Edition
Cleveland, OH: The Chemical Rubber Company
1964-present.
Based on error of estimation.
Based on error of estimation.
Datagen Versus Other Generators
Generator | Test |
---|---|
Ran2 | Mean Difference |
Drand48 | Mean Difference |
Ran2 | Binomial Distribution |
Drand48 | Binomial Distribution |
Ran2 | Chi-square Distribution |
Drand48 | Chi-square Distribution |
The two visual tests below were designed
independently of Knuth's
Spectral Test in Chapter 3.3.4.
stirl2.c generates stirling numbers
of the second kind using a recursive algorithm.
chirange.c calculates the acceptable
range of a chi-square test.
ksrange.c calculates the acceptable
range of a Kolmogorov-Smirnov test.
getn.c calculates the sample size
based on:
The programs that call getn.c are:
Visual Tests
Common Routines
Stirling Numbers
Acceptable Ranges
Sample Size Estimation
Subroutine | Calling Program | Type of Test |
---|---|---|
getn.c | bnmerr | Binomial |
getn.c | bnmsz | Binomial |
getn.c | poierr | Poisson |
getn.c | poisz | Poisson |
getn.c | meandfsz | Mean Difference |
getsz.c calculates the sample size based on:
The programs that call getsz.c are:
Subroutine | Calling Program | Type of Test |
---|---|---|
getsz.c | flipsz | Coin Flip |
getsz.c | dicesz | Rolling 2 to 8 Dice |
getsz.c | smplsz | Sample Mean |
Table of Contents
Job and Record Level Definitions
Field Level Definitions
Datagen Syntax
FAQ
Copying