# High performance CUDA uniform float random number generator within device function

2

I need a random number generator with high performance that is used for Monte Carlo calculation on particle transport. The requirements are:

1. Independently generated by each thread

2. The period of the generator should be larger than 2^40

I have tried Tausworthe generator(mathematics of computation, 65, 213 (1996), 203-213), whose period is 2^88. The device function is as follows:

``````__device__ float GenerateRNF(unsigned int *seed, int seedNum=3)
{
unsigned int s1 = seed;
unsigned int s2 = seed;
unsigned int s3 = seed;
unsigned int b = (((s1 << 13)^s1) >> 19);
s1 = (((s1 & 0xFFFFFFFE) << 12)^b);
b = (((s2 << 2) ^ s2) >> 25);
s2 = (((s2 & 0xFFFFFFF8) << 4) ^ b);
b = (((s2 << 3) ^ s3) >> 11);
s3 = (((s3 & 0xFFFFFFF0) << 17) ^ b);
return ((s1 ^ s2 ^ s3) * 2.32830644e-10f);
}
``````

The parameter from seed should be register numbers, and this function is invoked hundreds of times by one thread. Each thread matains it’s own seed sequence. If the block size is 64, the thread number is 1024, then the initial seed number initialized in host would be 1024*64*3.

However, the above generator is still very slow. I have tried park-miller algorithm, the speed is several times boost compared to Tausworthe generator (using CUDA double multiply operation tested in GTX1070)

However, it’s periods is much smaller than 2^40.

My question is, is there any random number generator, whose period is better than park-miller algorithm, but it’s speed is several times better than Tausworthe generator? The function form is limited to the above sample. Thanks very much!

algorithm
performance
optimization
cuda
gpu