How to get an error message when exceeding thread memory capacity in Cuda?

Question

How to get an error message when exceeding thread memory capacity in Cuda?

I'm new to cuda programming. I'm building some classes to make an interface for curand. The testing program has a strange behaviour though.

The goal is to generate N random numbers on the GPU. This is the main routine:

int main(){

    // Number of results
    int N = pow(10, 7);

    output.open("gpu.dat");

    int threads_per_block = 32;
    int blocks = 200;

    // Number of results each thread must compute
    int res_each_thread = N / threads_per_block / blocks + 1;
    // Reset N to so that every thread computes the same number of results
    N = threads_per_block * blocks * res_each_thread;

    // User-defined class that provides an interface for curand
    MTGP32 randomengine(threads_per_block, blocks);

    // Vector of results, on host and on device
    vector<double> results(N);
    double * d_results;
    cudaMalloc((void **)&d_results, N * sizeof(double));

    // This is the pointer to the first position of array of the generator states
    auto mtgp_states = randomengine.getStatesPtr();

    kernel<<<blocks,threads_per_block>>>(mtgp_states, res_each_thread, d_results);

    cudaDeviceSynchronize();

    cudaMemcpy(results.data(), d_results, N * sizeof(double), cudaMemcpyDeviceToHost);

    for(auto & i : results){
        output << i << "\n";
    }

    output.close();

    return 0;
}

And the kernel routine:

__global__ void kernel(MTGP32_state * states, int res_each_thread, double * results){
    int th_id = threadIdx.x + blockIdx.x * blockDim.x;

    __shared__ RandomDistrGPU<thrust::normal_distribution<>> * distr;
    if( threadIdx.x == 0 ){
        distr = new RandomDistrGPU<thrust::normal_distribution<>>(states + blockIdx.x, 0., 1.);
    }
    __syncthreads();

    auto * v = new double[res_each_thread];
    for(int i = 0; i < res_each_thread; ++i){
        v[i] = distr->generate();
        results[i * (blockDim.x * gridDim.x) + th_id] = v[i];
    }
    __syncthreads();

    if( threadIdx.x == 0 ){
        delete distr;
    }
    delete[] v;
}

This program worked fine with N=10^6, but gave only "0" as results with no error message with N=10^7. Running through cuda-memcheck tool it was reported some allocation failure inside the kernel, due to "Address 0x00000000 is out of bounds". I supposed the cause of this issue was the block memory capacity, that was not big enough to contain all the res_each_thread doubles, but I'm not sure. My GPU has a 5.0 architecture, and it should have 64KB of memory per block. Trying to "manually count" the occupied space, it seems that the required memory exceeds this limit with N=10^7.

Question 1: is my guessing right? I'm not sure I understood how the memory management works. How much memory is available for each thread? Is it the memory per block, less the memory allocated for __shared__ objects and divided for the number of threads?

Question 2: is there a way to compute in runtime the memory that a certain kernel needs to allocate (and eventually print a message)? Or, even better, is there a function that warns you that you are trying to allocate more space than what you can afford?

Question 3: how can I stop the execution of the program and get an error message saying that something went wrong? This is because the allocation failure inside the kernel stop its execution and the flow in the main routine goes on like nothing happened.

P.S. A "solution" for all the problems here could be to directly set results[i * (blockDim.x * gridDim.x) + th_id] = distr->generate(); and do not allocate space for the temporary v array, but my goal here is to understand the logic behind.

c++

cuda

gpu-programming

asked on Stack Overflow May 19, 2020 by

dariobaron • edited May 19, 2020 by

dariobaron

0 Answers

Nobody has answered this question yet.

User contributions licensed under CC BY-SA 3.0