Error with 'cuda-memcheck' in cuda 8.0

Question

Error with 'cuda-memcheck' in cuda 8.0

It is strange that when I do not add cuda-memcheck before ./main, the program runs without any warning or error message, however, when I add it, it will have error message like following.

========= Invalid __global__ write of size 8
=========     at 0x00000120 in initCurand(curandStateXORWOW*, unsigned long)
=========     by thread (9,0,0) in block (3,0,0)
=========     Address 0x5005413b0 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2c5) [0x204115]
=========     Host Frame:./main [0x18e11]
=========     Host Frame:./main [0x369b3]
=========     Host Frame:./main [0x3403]
=========     Host Frame:./main [0x308c]
=========     Host Frame:./main [0x30b7]
=========     Host Frame:./main [0x2ebb]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf0) [0x20830]

Here is my functions, a brief introduction on the code, I try to generate a random numbers and save them to a device variable weights, then use this vector to sample from discrete numbers.

#include<iostream>
#include<curand.h>
#include<curand_kernel.h>
#include<time.h>

using namespace std;

#define num 100


__device__ float weights[num];

// function to define seed
__global__ void initCurand(curandState *state, unsigned long seed){
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    curand_init(seed, idx, 0, &state[idx]);
}


__device__ void sampling(float *weight, float max_weight, int *index, curandState *state){
    int j;
    float u;    
    do{
        j = (int)(curand_uniform(state) * (num + 0.999999)); 
        u = curand_uniform(state); //sample from uniform distribution;
    }while( u > weight[j]/max_weight);
    *index  = j;
}

__global__ void test(int *dev_sample, curandState *state){
    int idx     = threadIdx.x + blockIdx.x * blockDim.x;\
    // generate random numbers from uniform distribution and save them to weights
    weights[idx]    = curand_uniform(&state[idx]);
    // run sampling function, in which, weights is an input for the function on each thread
    sampling(weights, 1, dev_sample+idx, &state[idx]);
}


int main(){ 
    // define the seed of random generator
    curandState *devState;  
    cudaMalloc((void**)&devState, num*sizeof(curandState));

    int *h_sample;
    h_sample    = (int*) malloc(num*sizeof(int));

    int *d_sample;
    cudaMalloc((void**)&d_sample, num*sizeof(float));

    initCurand<<<(int)num/32 + 1, 32>>>(devState, 1);
    test<<<(int)num/32 + 1, 32>>>(d_sample, devState);

    cudaMemcpy(h_sample, d_sample, num*sizeof(float), cudaMemcpyDeviceToHost);

    for (int i = 0; i < num; ++i)
    {
        cout << *(h_sample + i) << endl;
    }

    //free memory
    cudaFree(devState);
    free(h_sample);
    cudaFree(d_sample);
    return 0;
}

Just start to learn cuda, if the methods to access the global memory is incorrect, please help me with that. Thanks

cuda

asked on Stack Overflow Nov 10, 2016 by

Fly_back

1 Answer

This is launching "extra" threads:

initCurand<<<(int)num/32 + 1, 32>>>(devState, 1);

num is 100, so the above config will launch 4 blocks of 32 threads each, i.e. 128 threads. But you are only allocating space for 100 curandState here:

cudaMalloc((void**)&devState, num*sizeof(curandState));

So your initCurand kernel will have some threads (idx = 100-127) that are attempting to initialize some curandState that you haven't allocated. As a result when you run cuda-memcheck which does fairly rigorous out-of-bounds checking, an error is reported.

One possible solution would be to modify your initCurand kernel as follows:

__global__ void initCurand(curandState *state, unsigned long seed, int num){
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < num)
        curand_init(seed, idx, 0, &state[idx]);
}

This will prevent any out-of-bounds threads from doing anything. Note that you will need to modify the kernel call to pass num to it. Also, it appears to me you have a similar problem in your test kernel. You may want to do something similar to fix it there. This is a common construct in CUDA kernels, I call it a "thread check". You can find other questions here on the SO tag discussing this same concept.

answered on Stack Overflow Nov 10, 2016 by

Robert Crovella

User contributions licensed under CC BY-SA 3.0