I'm writing a CUDA implementation to generate "magic bitboards" for a chess engine. I have a working CUDA version, but I'm attempting to optimize it. In the end I have two arrays, 'questions' and 'answers', and I'm trying to generate a magic number to multiply by any particular question, and get a hashed index. I then use that index to look up the value in a table. The goal is to generate the indexes 0-[n], or to generate a 'perfect hashing' function.
However, I believe the main problem is in accessing the memory. The following section of code does return a magic number, but it is incorrect.
The broken section of code looks like this:
int index = threadIdx.x;
__shared__ int magic_good;
__shared__ u64_t magic_number;
// only try 1000 times, as many times this algorithm can't ever succeed
// more iterations are done at a higher level
for (int tries = 0; tries < 1000; tries++) {
    resulting_magic_moves[index] = 0xffffffffffffffff;
    magic_good = 1;
    if (index == 0) { // only generate one random number
        magic_number = random(); // using KISS as a random number generator, actually
    }
    __syncthreads();
   int magic_index = (question[index] * magic_number) >> (64 - questions_bits);
   // this 'magic_index' is basically just a random number at this point
   // as a result, I need some sort of locking on this array...
   resulting_magic_moves[magic_index] = answers[index]; // this could be set by multiple threads
   // this is my attempt to deal with the locking
   __syncthreads();
   if (resulting_magic_moves[magic_index] != answers[index]) {
        magic_good = 0;
    }
    __syncthreads();
    if (magic_good) // set to one at the beginning
        break; // it worked!  Or did it?
}
if (magic_good && index == 0)
    *in_magic_number = magic_number;
I believe that the array resulting_magic_moves[magic_index] is getting set in every thread, and the compiler is remembering that it put 'answers[index]' into that spot, and no other section would have modified it. I did declare the pointer as 'volitile' as a test, but that didn't work.
I'm using a gridDim of 1, and a blockDim equal to the number of questions and answers. My calling code looks like:
magic_brute_force<<<1, questions>>>(d_magic_number, d_magic_moves_table, questions_bits, d_questions, d_answers, usec (a seed));
where
questions == 1<<questions_bits.  (The nature of the algorithm forces a power of 2 questions and answers).
Any help on how to properly lock this would be greatly appreciated. I have a different version where each thread generates its own number and checks every index itself, but this doesn't seem to be the CUDA way.
EDIT: The above was a simplified version, the full code is posted below, with the questions and answers being read from a file, and with a functional CUDA kernel commented out:
#include <stdlib.h>
#include <stdio.h>
#include <cuda_runtime.h>
#include <sys/time.h>
#include "fileio.c" // compile it into one file... kinda ugly, but whatever...
typedef unsigned long long bb_t;
typedef struct {
    bb_t magic_number;
    int bits;
} magic_t;
// random number generator stolen from somewhere...
#define znew(m) (m[0]=36969*(m[0]&65535)+(m[0]>>16)) // z
#define wnew(m) (m[1]=18000*(m[1]&65535)+(m[1]>>16)) // w
#define MWC(m) ((znew(m)<<16)+wnew(m)) // znew, wnew
#define SHR3(m) (m[2]^=(m[2]<<17), m[2]^=(m[2]>>13), m[2]^=(m[2]<<5)) // jsr
#define CONG(m) (m[3]=69069*m[3]+1234567) //jcong
#define KISS(m) ((MWC(m)^CONG(m))+SHR3(m)) // MWC, CONG, SHR3
#define SETUP_KISS(m, seed) m[0] = seed; m[1] = (seed ^ 0xdeadbeef); m[2] = m[1] ^ 0x12345678; m[3] = m[1] + m[2]
#define LONG_KISS(m) ((((bb_t)KISS(m))<<32)|((bb_t)KISS(m)))
#define LONG_KISS_FEWBITS(m) LONG_KISS(m) & LONG_KISS(m) & LONG_KISS(m)
typedef unsigned long UL;
#define magic_perform(magic, occ) (((magic).magic_number * (occ)) >> (64 - (magic).bits))
const bb_t bitboard_universe = 0xffffffffffffffff;
#define TRIES 100
__global__ void magic_brute_force(magic_t *magic_answer, volatile bb_t *resulting_magic_moves,
                              const int questions_bits, const int desired_bits,
                              const bb_t *question, const bb_t *answer, int seed) {
/*    int desired_size = 1<<desired_bits;
    int index = threadIdx.x;
    int questions = 1<<questions_bits;
    UL random_state1[4];
    SETUP_KISS(random_state1, ((index+1)*seed));
    magic_t my_magic;
    my_magic.bits = desired_bits;
    bb_t too_small_table[512];
    int tries = TRIES;
    while (magic_answer->bits == 0 && tries--) {
    my_magic.magic_number = LONG_KISS_FEWBITS(random_state1);
    for (int a = 0; a < desired_size; a++) {
        too_small_table[a] = bitboard_universe;
    }
    int q;
    for (q = 0; q < questions; q++) {
        int index = magic_perform(my_magic, question[q]);
        if (too_small_table[index] == bitboard_universe) {
            too_small_table[index] = answer[q];
        } else if (answer[q] != too_small_table[index]) {
            break;
        }
    }
    if (q == questions) {
        *magic_answer = my_magic;
    }
    __syncthreads();
    }
    if (magic_answer->magic_number == my_magic.magic_number) {
    for (int a = 0; a < desired_size; a++) {
        resulting_magic_moves[a] = too_small_table[a];
    }
    }
*/
    int index = threadIdx.x;
    int desired_size = 1<<desired_bits;
    __shared__ int magic_good;
    magic_good = 0;
    UL rs[4];
    SETUP_KISS(rs, seed);
    __shared__ magic_t tmp_magic;
    __syncthreads();
    for (int tries = 0; tries < TRIES; tries++) {
    if (index < desired_size)
        resulting_magic_moves[index] = bitboard_universe;
    magic_good = 1;
    if (index == 0) {
        tmp_magic.magic_number = LONG_KISS_FEWBITS(rs);
        tmp_magic.bits = desired_bits;
    }
    __syncthreads();
    int magic_index = magic_perform(tmp_magic, question[index]);
    resulting_magic_moves[magic_index] = answer[index];
    __syncthreads();
    if (resulting_magic_moves[magic_index] != answer[index])
        magic_good = 0;
    if (index < desired_size)
        resulting_magic_moves[index] = bitboard_universe;
    __syncthreads();
    if (magic_good) {
        if (index == 0)
            *magic_answer = tmp_magic;
        break;
    }
    }
}
/**
 * Host main routine
 */
int main(int argc, char **argv)
{
    // Error code to check return values for CUDA calls
    cudaError_t err = cudaSuccess;
    if (argc != 3) {
    printf("Usage %s filename answers_bits\n", argv[0]);
    exit(1);
    }
    FILE *file = fopen(argv[1], "r");
    if (!file) {
    printf("Couldn't open %s\n", argv[1]);
    fclose(file);
    exit(1);
    }
    int questions_bits;
    full_read(file, &questions_bits, sizeof(int));
    int answers_bits = atoi(argv[2]);
    size_t questions_size = 1<<questions_bits;
    size_t answers_size = 1<<answers_bits;
    printf("Magic number generation decrease by %d orders of magnitude\n", questions_bits - answers_bits);
    bb_t h_questions[questions_size];
    bb_t h_answers[questions_size];
    magic_t h_magic_answer;
    bb_t h_magic_moves[answers_size];
    printf("questions size: %i\nquestions bits: %i\nanswers size: %i\nannswers_bits: %i\n",
        questions_size, questions_bits, answers_size, answers_bits);
    // this simplifies the code below a little bit...
    questions_size *= sizeof(bb_t);
    answers_size *= sizeof(bb_t);
    if (full_read(file, h_questions, questions_size) ||
    full_read(file, h_answers, questions_size)) {
    printf("Couldn't read from %s\n", argv[1]);
    exit(1);
    }
    fclose(file);
    // Allocate the device memory
    bb_t *d_questions, *d_answers, *d_magic_moves;
    magic_t *d_magic_answer;
    UL *d_random_state;
    int mem_error;
    // this relies on short circuiting
    mem_error = ((err = cudaMalloc(&d_questions, questions_size)) != cudaSuccess) ||
            ((err = cudaMalloc(&d_answers, questions_size)) != cudaSuccess) ||
            ((err = cudaMalloc(&d_magic_moves, answers_size)) != cudaSuccess) ||
            ((err = cudaMalloc(&d_magic_answer, sizeof(magic_t))) != cudaSuccess) ||
            ((err = cudaMalloc(&d_random_state, 4*sizeof(UL))) != cudaSuccess);
    if (mem_error)
    {
    fprintf(stderr, "Failed to allocate device vector (error code %s)!\n", cudaGetErrorString(err));
    return -1;
    }
    printf("Copy input data from the host memory to the CUDA device\n");
    mem_error = ((err = cudaMemcpy(d_questions, h_questions, questions_size, cudaMemcpyHostToDevice)) != cudaSuccess) ||
            ((err = cudaMemcpy(d_answers, h_answers, questions_size, cudaMemcpyHostToDevice)) != cudaSuccess) ||
            ((err = cudaMemset(d_magic_answer, 0, sizeof(magic_t))) != cudaSuccess) ||
            ((err = cudaMemset(d_random_state, 0xff, sizeof(UL)*4)) != cudaSuccess);
    if (mem_error)
    {
    fprintf(stderr, "Failed to copy vector from host to device (error code %s)!\n", cudaGetErrorString(err));
    return -1;
    }
    // Launch the Vector Add CUDA Kernel
    int threadsPerBlock = questions_size/sizeof(bb_t);
    int blocksPerGrid = 1; // (questions_size + threadsPerBlock - 1) / threadsPerBlock;
    printf("CUDA kernel launch with %d blocks of %d threads\n", blocksPerGrid, threadsPerBlock);
    struct timeval start, end;
    bb_t usec = 0;
    struct timezone tz;
    h_magic_answer.bits = 0;
    bb_t total_iterations = 0;
    gettimeofday(&start, &tz);
    while (h_magic_answer.bits == 0) {
    magic_brute_force<<<blocksPerGrid, threadsPerBlock>>>(d_magic_answer, d_magic_moves, questions_bits, answers_bits, d_questions, d_answers, usec);
    err = cudaGetLastError();
    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to launch brute force kernel (error code %s)!\n", cudaGetErrorString(err));
        return -1;
    }
    // Copy the device result vector in device memory to the host result vector
    // in host memory.
    mem_error = ((err = cudaMemcpy(&h_magic_answer, d_magic_answer, sizeof(magic_t), cudaMemcpyDeviceToHost)) != cudaSuccess);
    if (mem_error)
    {
        fprintf(stderr, "Failed to copy magic_answer from device to host (error code %s)!\n", cudaGetErrorString(err));
        return -1;
    }
    gettimeofday(&end, &tz);
    usec = (((bb_t)end.tv_sec*1000000)+end.tv_usec)-(((bb_t)start.tv_sec)*1000000+start.tv_usec);
    int iterations =  TRIES*threadsPerBlock*blocksPerGrid;
    total_iterations += iterations;
    // as it turns out it/usec == million it/sec, unit wise...
    float million_it_per_sec = ((float)total_iterations)/usec;
    if (total_iterations / 1000000 != (total_iterations - iterations) / 1000000) {
        printf("Ran in %lf million iterations in %lld.%03lld seconds at %f million it/sec\n", ((double)total_iterations)/1000000, usec / 1000000, usec % 1000, million_it_per_sec);
    }
    }
    mem_error = ((err = cudaMemcpy(h_magic_moves, d_magic_moves, answers_size, cudaMemcpyDeviceToHost)) != cudaSuccess);
    if (mem_error)
    {
    fprintf(stderr, "Failed to copy magic_moves from device to host (error code %s)!\n", cudaGetErrorString(err));
    return -1;
    }
    printf("Testing magic numbers\n");
    questions_size /= sizeof(bb_t);
    int i;
    printf("Magic Number: %i\n", h_magic_answer.magic_number);
    printf("Magic Bits: %i\n", h_magic_answer.bits);
    for (i = 0; i < questions_size; i++) {
    if (h_answers[i] != h_magic_moves[magic_perform(h_magic_answer, h_questions[i])])
        break;
    }
    if (i == questions_size)
    printf("Test PASSED\n");
    else
    printf("Test FAILED!\n");
    // Free device global memory
    mem_error = ((err = cudaFree(d_random_state)) != cudaSuccess) ||
            ((err = cudaFree(d_magic_answer)) != cudaSuccess) ||
            ((err = cudaFree(d_magic_moves)) != cudaSuccess) ||
            ((err = cudaFree(d_questions)) != cudaSuccess) ||
            ((err = cudaFree(d_answers)) != cudaSuccess);
    if (mem_error)
    {
    fprintf(stderr, "Failed to free device memory (error code %s)!\n", cudaGetErrorString(err));
    return -1;
    }
    // Reset the device and exit
    // cudaDeviceReset causes the driver to clean up all state. While
    // not mandatory in normal operation, it is good practice.  It is also
    // needed to ensure correct operation when the application is being
    // profiled. Calling cudaDeviceReset causes all profile data to be
    // flushed before the application exits
    err = cudaDeviceReset();
    if (err != cudaSuccess)
    {
    fprintf(stderr, "Failed to deinitialize the device! error=%s\n", cudaGetErrorString(err));
    return -1;
    }
    printf("Done\n");
    return 0;
}
Double Edit: cuda-memcheck --tool racecheck returns the following output:
========= ERROR: Potential RAW hazard detected at __shared__ 0x3 in block (0, 0, 0) :
=========     Write Thread (145, 0, 0) at 0x00000158 in magic_brute_force(magic_t*, __int64 volatile *, int, int, __in
t64 const *, __int64 const *, int)
=========     Read Thread (330, 0, 0) at 0x00000a88 in magic_brute_force(magic_t*, __int64 volatile *, int, int, __int
64 const *, __int64 const *, int)
=========     Current Value : 0
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/libcuda.so (cuLaunchKernel + 0x331) [0x138371]
=========     Host Frame:./magic [0x1aea8]
=========     Host Frame:./magic [0x3b7e3]
=========     Host Frame:./magic [0x35e3]
=========     Host Frame:./magic [0x344a]
=========     Host Frame:./magic [0x3490]
=========     Host Frame:./magic [0x2ddd]
=========     Host Frame:/usr/lib/libc.so.6 (__libc_start_main + 0xf0) [0x20000]
=========     Host Frame:./magic [0x2599]
And it prints it out a lot.
Triple Edit:
I fixed my own problem. If you look at the second section of code, you'll note that I incorrectly write to the resulting_magic_moves array at the end of the function, with:
resulting_magic_moves[index] = bitboard_universe;
thus destroying my results. Thanks for all the input!
If you simply let that be set at the beginning, and not set it at the end, it works fine (despite the memory access problems reported by race).
However, the first kernel does perform better, so not as much progress as I had hoped. Anyways, thanks for the help.
This was in fact not a cuda shared memory overlap issue as the question states. The posted code works, and locks are not needed. The implementation contained a logic error where one variable was being reset back to a default value.
In essence, at the end of the loop in magic_brute_force you'll see:
if (index < desired_size)
    resulting_magic_moves[index] = bitboard_universe;
This was simply removed to create a working kernel.
I posted more information in the question itself, including the actual code sample and the solution, in case anyone else is wondering about CUDA memory issues. The simplified code example is a working way to deal with cuda shared memory issues, in this case.
User contributions licensed under CC BY-SA 3.0