CUDA illegal memory access was encountered

0

I wrote a CUDA program on my own laptop which has Nvidia GTX 960M. The code works without any problems. I also implemented the error check which can be found in this thread: What is the canonical way to check for errors using the CUDA runtime API?

and also tested the code using the cuda-memcheck, which has reported 0 errors.

I want to test my code on a server which has an Nvidia Titan X. However cudaPeekAtLastError() throws the error:

illegal memory access was encountered

For both my laptop and the server I am using the following heap allocation

cudaDeviceSetLimit(cudaLimitMallocHeapSize, 1024 * 1024 * 1024);

and running the following threads and blocks:

int blockSize = 128;
int numBlocks = (nPossibilities + blockSize - 1) / blockSize;

GTX 960M has a compute capability of 5, while Titan X has 6.1 but both according the the compute capability table have a maximum of 32 active blocks and a maximum of 2048 threads per multiprocessor

https://en.wikipedia.org/wiki/CUDA

I ran the cuda-memcheck on the server and the problem of illegal memory access is due to a null pointer.

In order to solve the problem, I have increased the heap memory size allocation from 1GB to 2GB using the following lines and the problem was solved:

const size_t malloc_limit = size_t(2048) * size_t(2048) * size_t(2048);
cudaDeviceSetLimit(cudaLimitMallocHeapSize, malloc_limit); 

my question is why does this problem arises on the Titan X but doesn't occur on 960M? Why do I need to increase the heap memory size allocated for Titan X but not for 960M?

I can post my code if requested, but it's a big code with several function calls inside the kernel.

The error after cuda-memcheck is below:

GPUassert: unspecified launch failure all.cu 779 ========= CUDA-MEMCHECK ========= Invalid global write of size 8 ========= at 0x00001130 in /home/osa/cuda/all.cu:186:fun(double*, double*, double*, double*, double*, double*, int, int, int) ========= by thread (125,0,0) in block (193,0,0) ========= Address 0x00000000 is out of bounds ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame:/usr/lib/i386-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2fe) [0x282a4e] ========= Host Frame:./all [0x1dac1] ========= Host Frame:./all [0x382d3] ========= Host Frame:./all [0x9508] ========= Host Frame:./all [0x93c0] ========= Host Frame:./all [0x942d] ========= Host Frame:./all [0x8d7a] ========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf0) [0x20840] ========= Host Frame:./all [0x2999] ========= ========= Invalid global write of size 8 ========= at 0x00001130 in /home/osa/cuda/all.cu:186:fun(double*, double*, double*, double*, double*, double*, int, int, int) ========= by thread (124,0,0) in block (193,0,0) ========= Address 0x00000000 is out of bounds ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame:/usr/lib/i386-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2fe) [0x282a4e] ========= Host Frame:./all [0x1dac1] ========= Host Frame:./all [0x382d3] ========= Host Frame:./all [0x9508] ========= Host Frame:./all [0x93c0] ========= Host Frame:./all [0x942d] ========= Host Frame:./all [0x8d7a] ========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf0) [0x20840] ========= Host Frame:./all [0x2999] ========= ========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaDeviceSynchronize. ========= Saved host backtrace up to driver entry point at error ========= Host Frame:/usr/lib/i386-linux-gnu/libcuda.so.1 [0x391b13] ========= Host Frame:./all [0x3c2c6] ========= Host Frame:./all [0x8d83] ========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf0) [0x20840] ========= Host Frame:./all [0x2999] ========= ========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaPeekAtLastError. ========= Saved host backtrace up to driver entry point at error ========= Host Frame:/usr/lib/i386-linux-gnu/libcuda.so.1 [0x391b13] ========= Host Frame:./all [0x39b93] ========= Host Frame:./all [0x8d88] ========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf0) [0x20840] ========= Host Frame:./all [0x2999] ========= ========= ERROR SUMMARY: 4 errors

In my code the total number of possibilities for a combination of at most 19 digit number is calculated. This number determines the total number of threads. The possibilites are calculated by (2^n)-1 so if I choose 9 digits it will be 511, so the process will execute 511 threads in total.

Although for Kernel configuration I choose the blocksize to be 128, I also give the number of possibilities (nPossibilities) as a parameter and inside the kernel I do the following:

if (idx > 0 && idx < nPossibilities)
{
 //Do something
}

On the server the code works up till 15 digits which corresponds to 32,767. 16 and above results in the error posted in the question. For 16 it will be 65,536. Does that mean for the Titan Xp ~32,000 threads in flight require 1GB of heap and above that I need to allocate more? But for 19 digits I will need 524,287 threads in total! Which is alot! So how is 1GB enough for ~32,000 threads, while 2GB are enough for ~524,000 threads?

The size of the variables I allocate using new inside the Kernel, depends also on the number of digits. I roughly calculated the size of the allocated variables and for 15 digits it is 0.032MB, for 16 0.034MB and for 19 0.0415MB

c++
cuda
asked on Stack Overflow Sep 8, 2020 by ValSe • edited Sep 8, 2020 by ValSe

1 Answer

3

Because the Titan Xp supports more threads "in flight" than a 960M.

Presumably in your CUDA device code, you are doing something like malloc or new (and hopefully also free or delete). These allocate out of the device heap, the size of which is controlled by the CUDA runtime API call you are using for this: cudaDeviceSetLimit(cudaLimitMallocHeapSize, 1024 * 1024 * 1024);.

The maximum number of threads that can be running at a given time on either of these GPUs is given by 2048*number of SMs. Even if the occupancy number for your particular code is less than 2048 per SM, the number (max occupancy threads per SM) is probably the same whether on 960M or Titan Xp.

Therefore the total number of threads in flight is determined by the number of SMs. The 960M has 5 SMs, so it can have at most 2048x5 = ~10,000 threads in flight (i.e. in some stage of execution). The Titan Xp has 30 SMs, so it can have 2048x30 = ~60,000 threads in flight. That means that if each thread did a malloc of a particular size, and then later a free, you could have 10,000 of those allocations outstanding at any point in time on a 960M, but 60,000 of those outstanding at any point in time on a Titan Xp. More allocations outstanding = more demand on (device heap) memory.

So it's quite possible that you will need more available space for the device heap, on Titan Xp vs. 960M.

answered on Stack Overflow Sep 8, 2020 by Robert Crovella • edited Sep 8, 2020 by Robert Crovella

User contributions licensed under CC BY-SA 3.0