Preamble
This question is not about the behavior of (P)RNG and rand()
. It's about using power of two values uniformly distributed against modulo.
Introduction
I knew that one should not use modulo %
to convert a value from a range to another, for example to get a value between 0 and 5 from the rand()
function: there will be a bias. It's explained here https://bitbucket.org/haypo/hasard/src/ebf5870a1a54/doc/common_errors.rst?at=default and in this answer Why do people say there is modulo bias when using a random number generator?
But today after investigating some code which was looking wrong, I've made a tool to demonstrate the behavor of modulo: https://gitorious.org/modulo-test/modulo-test/trees/master and found that's not clear enough.
A dice is only 3 bits
I checked with 6 values in range 0..5. Only 3 bits are needed to code those values.
$ ./modulo-test 10000 6 3
interations = 10000, range = 6, bits = 3 (0x00000007)
[0..7] => [0..5]
theorical occurences 1666.67 probability 0.16666667
[ 0] occurences 2446 probability 0.24460000 ( +46.76%)
[ 1] occurences 2535 probability 0.25350000 ( +52.10%)
[ 2] occurences 1275 probability 0.12750000 ( -23.50%)
[ 3] occurences 1297 probability 0.12970000 ( -22.18%)
[ 4] occurences 1216 probability 0.12160000 ( -27.04%)
[ 5] occurences 1231 probability 0.12310000 ( -26.14%)
minimum occurences 1216.00 probability 0.12160000 ( -27.04%)
maximum occurences 2535.00 probability 0.25350000 ( +52.10%)
mean occurences 1666.67 probability 0.16666667 ( +0.00%)
stddev occurences 639.43 probability 0.06394256 ( 38.37%)
With 3 bits of input, the results are indeed awful, but behave as expected. See answer https://stackoverflow.com/a/14614899/611560
Increasing the number of input bits
What puzzled me, was increasing the number of input bits made the results different. You should not forgot to increase the number of iterations, eg the number of sample otherwise the results are likely wrong (see Wrong Statistics).
Lets try with 4 bits:
$ ./modulo-test 20000 6 4
interations = 20000, range = 6, bits = 4 (0x0000000f)
[0..15] => [0..5]
theorical occurences 3333.33 probability 0.16666667
[ 0] occurences 3728 probability 0.18640000 ( +11.84%)
[ 1] occurences 3763 probability 0.18815000 ( +12.89%)
[ 2] occurences 3675 probability 0.18375000 ( +10.25%)
[ 3] occurences 3721 probability 0.18605000 ( +11.63%)
[ 4] occurences 2573 probability 0.12865000 ( -22.81%)
[ 5] occurences 2540 probability 0.12700000 ( -23.80%)
minimum occurences 2540.00 probability 0.12700000 ( -23.80%)
maximum occurences 3763.00 probability 0.18815000 ( +12.89%)
mean occurences 3333.33 probability 0.16666667 ( +0.00%)
stddev occurences 602.48 probability 0.03012376 ( 18.07%)
Lets try with 5 bits:
$ ./modulo-test 40000 6 5
interations = 40000, range = 6, bits = 5 (0x0000001f)
[0..31] => [0..5]
theorical occurences 6666.67 probability 0.16666667
[ 0] occurences 7462 probability 0.18655000 ( +11.93%)
[ 1] occurences 7444 probability 0.18610000 ( +11.66%)
[ 2] occurences 6318 probability 0.15795000 ( -5.23%)
[ 3] occurences 6265 probability 0.15662500 ( -6.03%)
[ 4] occurences 6334 probability 0.15835000 ( -4.99%)
[ 5] occurences 6177 probability 0.15442500 ( -7.34%)
minimum occurences 6177.00 probability 0.15442500 ( -7.34%)
maximum occurences 7462.00 probability 0.18655000 ( +11.93%)
mean occurences 6666.67 probability 0.16666667 ( +0.00%)
stddev occurences 611.58 probability 0.01528949 ( 9.17%)
Lets try with 6 bits:
$ ./modulo-test 80000 6 6
interations = 80000, range = 6, bits = 6 (0x0000003f)
[0..63] => [0..5]
theorical occurences 13333.33 probability 0.16666667
[ 0] occurences 13741 probability 0.17176250 ( +3.06%)
[ 1] occurences 13610 probability 0.17012500 ( +2.08%)
[ 2] occurences 13890 probability 0.17362500 ( +4.18%)
[ 3] occurences 13702 probability 0.17127500 ( +2.77%)
[ 4] occurences 12492 probability 0.15615000 ( -6.31%)
[ 5] occurences 12565 probability 0.15706250 ( -5.76%)
minimum occurences 12492.00 probability 0.15615000 ( -6.31%)
maximum occurences 13890.00 probability 0.17362500 ( +4.18%)
mean occurences 13333.33 probability 0.16666667 ( +0.00%)
stddev occurences 630.35 probability 0.00787938 ( 4.73%)
Question
Please explain me why the results are different when changing the input bits (and increasing the sample count accordingly) ? What is the mathematical reasoning behind these ?
Wrong statistics
In the previous version of the question, I showed a test with 32bits of input and only 1000000 iterations, eg 10^6 samples, and said I was surprised to get correct results. It was so wrong I'm ashamed of: there must be N times more samples to have confidence to get all 2^32 values of the generator. Here 10^6 is way to small compaired to 2^32. Bonus for people able to explain this in mathematical/statistical language..
Here the wrong results:
$ ./modulo-test 1000000 6 32
interations = 1000000, range = 6, bits = 32 (0xffffffff)
[0..4294967295] => [0..5]
theorical occurences 166666.67 probability 0.16666667
[ 0] occurences 166881 probability 0.16688100 ( +0.13%)
[ 1] occurences 166881 probability 0.16688100 ( +0.13%)
[ 2] occurences 166487 probability 0.16648700 ( -0.11%)
[ 3] occurences 166484 probability 0.16648400 ( -0.11%)
[ 4] occurences 166750 probability 0.16675000 ( +0.05%)
[ 5] occurences 166517 probability 0.16651700 ( -0.09%)
minimum occurences 166484.00 probability 0.16648400 ( -0.11%)
maximum occurences 166881.00 probability 0.16688100 ( +0.13%)
mean occurences 166666.67 probability 0.16666667 ( +0.00%)
stddev occurences 193.32 probability 0.00019332 ( 0.12%)
I still have to read and re-read the excellent article of Zed Shaw "Programmers Need To Learn Statistics Or I Will Kill Them All".
In essence, you're doing:
(rand() & 7) % 6
Let's assume that rand()
is uniformly distributed on [0; RAND_MAX]
, and that RAND_MAX+1
is a power of two. It is clear that rand() & 7
can evaluate to 0
, 1
, ..., 7
, and that the outcomes are equiprobable.
Now let's look at what happens when you take the result modulo 6
.
This explains why you get twice as many zeroes and ones as you get the other numbers.
The same thing is happening in the second case. However, the value of the "extra" numbers is much smaller, making their contribution indistinguishable from noise.
To summarize, if you have an integer uniformly distributed on [0
; M-1
], and you take it modulo N
, the result will be biased towards zero unless M
is divisible by N
.
rand()
(or some other PRNG) produces values in the interval [0 .. RAND_MAX]
. You want to map these to the interval [0 .. N-1]
using the remainder operator.
Write
(RAND_MAX+1) = q*N + r
with 0 <= r < N
.
Then for each value in the interval [0 .. N-1]
there are
q+1
values of rand()
that are mapped to that value if the value is smaller than r
q
values of rand()
that are mapped to it if the value is >= r
.Now, if q
is small, the relative difference between q
and q+1
is large, but if q
is large - 2^32 / 6
, for example - the difference cannot easily be measured.
User contributions licensed under CC BY-SA 3.0