Statistical Tests

The random number generator in datagen produces 2.6E36 integers before the seed repeats itself. This is enough data to satisfy a large system test.

Seeds have two cycles. The seeds of the first cycle do not overlap the seeds of the second cycle. When you select an initial seed, you are selecting a starting point in the middle of one of these cycles.

The initialization routine hashes the seed to mask the pattern of the initial seed. All seeds are odd.

Datagen uses only the high order bit from each seed. The pattern of the low order bits in the seed do not affect the performance of datagen.

The tests below are described in:


Donald E. Knuth
The Art of Computer Programming
Volume 2, Chapter 3.3.1, General Test Procedures
Volume 2, Chapter 3.3.2, Empirical Tests


Lincoln L. Chao
Statistics for Management
Palo Alto, CA: The Scientific Press, 1984
Second Edition

The algorithms described in this book can be found in many other textbooks on inferential statistics.


CRC Standard Mathematical Tables
Cleveland, OH: The Chemical Rubber Company
1964-present.

This book contains density functions for various distributions.


Datagen consistently passes the tests listed below. If a test fails, rerun the test to be sure that the failure is not due to chance. Many of the tests are compared against a chi-square table with a 95 percent probability of success. This means that on the average, 19 out of 20 tests will pass the chi-square test.

If the results of the chi-square test hover at the high end of the range, it could mean that the data is especially chaotic. If the results hover at the low end of the range, it could mean that the data is too close to matching the population parameter. Ideally, the chi-square test should bounce around unpredictably within the range.

If a test fails, it always has to be determined whether the test is poorly designed or whether the data is particularly bad. One way to narrow this down is to throw some bias into the data. For instance, a normal distribution may be replaced with a trigonometric, logarithmic, or exponential distribution. When this happens, the results show clearly that the data is incorrect.

The coupon collector's test fails when the parameters are very close, such as:

      coupon 13 14 1000

I'm still investigating whether this is due to the parameters or due to the generator.

Click on the topics below to get more information about each test. The script, test.sh, performs all the numeric tests, together. The two visual tests are run separately.

The test procedure, test.sh, serves two purposes:

Chi Square Test

Equidistribution Test

Continuous Function Test

Linear Regression Test

Serial Test

Gap Test

Poker Test

Coupon Collector's Test

Permutation Test

Runs Test

Maximum of T Test

Collision Test

Serial Correlation Test

Subsequence Test

Subsequence Test, Power of 2

Independence Test, Chi-square Test

Dependent Probability Test, Chi-square Test

Coin Flip, Chi-square Test

Roll 2 to 8 Dice, Chi-square Test

Binomial Distribution, Chi-square Test

Binomial Distribution, Chi-square Test
    Based on error of estimation.

Poisson Distribution, Chi-square Test

Poisson Distribution, Chi-square Test
    Based on error of estimation.

Sample Mean Distribution, Chi-square Test

Difference Between Two Mean Proportions

Difference Between Two Sample Proportions

Powerball Lottery, Chi-square Test

52 Card Poker, Chi-square Test

5 Dice Poker, Chi-square Test

Bingo, Chi-square Test

Roulette, Chi-square Test

Slot Machine, Chi-square Test


Datagen Versus Other Generators

The following programs compare the datagen random number generator to other widely accepted random number generators.

Generator Test
Ran2 Mean Difference
Drand48 Mean Difference
Ran2 Binomial Distribution
Drand48 Binomial Distribution
Ran2 Chi-square Distribution
Drand48 Chi-square Distribution


Visual Tests

The two visual tests below were designed independently of Knuth's Spectral Test in Chapter 3.3.4.

Visual Test in Curses

Visual Test in X Windows


Common Routines

Stirling Numbers

stirl2.c generates stirling numbers of the second kind using a recursive algorithm.


Acceptable Ranges

chirange.c calculates the acceptable range of a chi-square test.

ksrange.c calculates the acceptable range of a Kolmogorov-Smirnov test.


Sample Size Estimation

getn.c calculates the sample size based on:

The programs that call getn.c are:

Subroutine Calling Program Type of Test
getn.c bnmerr Binomial
getn.c bnmsz Binomial
getn.c poierr Poisson
getn.c poisz Poisson
getn.c meandfsz Mean Difference

getsz.c calculates the sample size based on:

The programs that call getsz.c are:

Subroutine Calling Program Type of Test
getsz.c flipsz Coin Flip
getsz.c dicesz Rolling 2 to 8 Dice
getsz.c smplsz Sample Mean


Table of Contents     Job and Record Level Definitions
Field Level Definitions     Datagen Syntax
FAQ     Copying