Out of bound address when directly reading from array


I am developing a CUDA application which has some routines for allocation and deallocation of arrays in shared memory.

In this application (that, I am sorry, I cannot make available), I have a class that encapsulate a chunk of memory as an array. This class has a count method that counts the number of elements that matches a certain value.

So, imagine something like (which is an actual part of the whole class)

template <class Type>
struct Array {
    // ...

    Type &operator[](int i) { return data_[i]; }
    Type operator[](int i) const { return data_[i]; }

    size_t count(const Type &val) const {
        size_t c = 0;
        for (size_t i = 0; i < len_; ++i)
            if (data_[i] == val)
        return c;

    void print(const char *fmt, const char *sep, const char *end) const {
        for (size_t i = 0; i < len_ - 1; ++i) {
            printf(fmt, data_[i]);
        printf(fmt, _data[len_ - 1]);
    Type *data_;
    size_t len_;

Assumed that the memory I am accessing is correctly allocated (shared memory allocated at runtime, passing the dimension to the kernel), that is big enough to contain data and that data_ points to an aligned (wrt Type) region inside the shared memory. I checked this multiple times, and these assumptions shall be valid (but feel free to ask more checkings).

Now, while testing the code I found something very strange:

  • When explicitly assigning values using operator[], and reading them using operator[] const, no issues arise.
  • When reading data using print, no issues arises.
  • When calling count(), program crashes and Address ADDR is out of bounds is reported by cuda-memcheck, caused by Invalid __global__ read of size x (x = sizeof(Type)). The ADDR is inside the shared memory buffer, so it should be valid.
  • If, inside count, I replace data_[i] with (*this)[i], the program runs fine and no crash occurs.

Now, I have absolutely no idea about this could happen, and I have no idea about what to check to see what is happening behind the scenes... Why reading directly crashes? Why using operator[] does not? And why reading (directly?) inside print do not crashes?

I know this question is hard and I am sorry to provide this little information about the code... But feel free to ask for detail, I will try to answer as much as I can. Any idea or suggestion is welcome, because it is days I am trying to solve and this is as far as I could get.

I am using two different GPUs to test this code, one with capability 2.1 and one with 3.5 (the latter one is giving me detailed information about this crash, while the first one does not). CUDA 5.0

EDIT: I have located a minimal example where this error happens. Curiously, the error appears when compiling with sm_20 and sm_35, but not on sm_30. The GPU I am using has cap 3.5

/* Compile and run with:
  nvcc -g -G bug.cu -o bug -arch=sm_20 # bug!
  nvcc -g -G bug.cu -o bug -arch=sm_30 # no bug :|
  nvcc -g -G bug.cu -o bug -arch=sm_35 # bug!
  cuda-memcheck bug

Here's the output (skipping the initial rows) I get
Ctor for 0x3fffc10 w/o alloc, data 0x10000c8
Calling NON CONST []
Calling NON CONST []
Fill with [] ok
Fill with raw ok
Kernel launch failed with error:
        unspecified launch failure
========= Invalid __global__ write of size 8
=========     at 0x00000188 in /home/bio/are/AlgoCUDA/bug.cu:26:array<double>::fill(double const &)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x010000c8 is out of bounds
=========     Device Frame:/home/bio/are/AlgoCUDA/bug.cu:49:kernel_bug(unsigned long) (kernel_bug(unsigned long) : 0x8c0)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/libcuda.so (cuLaunchKernel + 0x3dc) [0xc9edc]
=========     Host Frame:/opt/cuda-5.0/lib64/libcudart.so.5.0 [0x13324]
=========     Host Frame:/opt/cuda-5.0/lib64/libcudart.so.5.0 (cudaLaunch + 0x182) [0x3ac62]
=========     Host Frame:bug [0xbb8]
=========     Host Frame:bug [0xaa7]
=========     Host Frame:bug [0xac4]
=========     Host Frame:bug [0xa07]
=========     Host Frame:/lib/libc.so.6 (__libc_start_main + 0xfd) [0x1ec4d]
=========     Host Frame:bug [0x8c9]
========= Program hit error 4 on CUDA API call to cudaDeviceSynchronize 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/libcuda.so [0x26a180]
=========     Host Frame:/opt/cuda-5.0/lib64/libcudart.so.5.0 (cudaDeviceSynchronize + 0x1dd) [0x441fd]
=========     Host Frame:bug [0xa0c]
=========     Host Frame:/lib/libc.so.6 (__libc_start_main + 0xfd) [0x1ec4d]
=========     Host Frame:bug [0x8c9]
========= ERROR SUMMARY: 2 errors

(cuda-gdb) set cuda memcheck on
(cuda-gdb) run
Starting program: /home/bio/are/AlgoCUDA/bug 
[Thread debugging using libthread_db enabled]
[New Thread 0x7ffff5c25700 (LWP 23793)]
[Context Create of context 0x625870 on Device 0]
[Launch of CUDA Kernel 0 (kernel_bug<<<(1,1,1),(1,1,1)>>>) on Device 0]
Memcheck detected an illegal access to address (@global)0x10000c8

Program received signal CUDA_EXCEPTION_1, Lane Illegal Address.
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 12, warp 0, lane 0]
0x0000000000881928 in array<double>::fill (this=0x3fffc10, v=0x3fffc08) at bug.cu:26
26                              data[i] = v;

#include <stdio.h>

extern __shared__ char totalSharedMemory[];

template <class Type>
struct array {
    // Create an array using a specific buffer
    __device__ __host__ array(size_t len, Type *buffer):
        data(buffer) {
        printf("Ctor for %p w/o alloc, data %p\n", this, data);
    __device__ __host__ Type operator[](int i) const {
        printf("Calling CONST []\n");
        return data[i];
    __device__ __host__ Type &operator[](int i) {
        printf("Calling NON CONST []\n");
        return data[i];
    __device__ __host__ void fill(const Type &v) {
        for (size_t i = 0; i < len; ++i) data[i] = v;
    size_t len;
    Type *data;

__global__ void kernel_bug(size_t bytesPerBlock) {
    // This is a test writing to show that filling the memory
    // does not produce any error
    for (size_t i = 0; i < bytesPerBlock; ++i) {
        totalSharedMemory[i] = i % ('z' - 'a' + 1) + 'a';
        printf("[%p] %c\n", totalSharedMemory + i, totalSharedMemory[i]);

    // 200 / 8 = 25 so should be aligned
    array<double> X(2, (double *)(totalSharedMemory + 200));
    X[0] = 2;
    X[1] = 4;
    printf("Fill with [] ok\n");
    X.data[0] = 1;
    X.data[1] = 0;
    printf("Fill with raw ok\n");
    X.fill(0); // Crash here
    printf("Fill with method ok\n");

int main(int argc, char **argv) {
    // Total memory required
    size_t bytesPerBlock = 686; // Big enough for 85 doubles
    kernel_bug<<<1, 1, bytesPerBlock>>>(bytesPerBlock);
    cudaError_t err = cudaDeviceSynchronize();
    if (err != cudaSuccess) {
        fprintf(stderr, "Kernel launch failed with error:\n\t%s\n", cudaGetErrorString(err));
        return 1;
    return 0;

EDIT: tested also with CUDA 4.2, the problem persists.

asked on Stack Overflow Apr 8, 2013 by AkiRoss • edited Apr 9, 2013 by AkiRoss

2 Answers


I was able to reproduce your issue with the following:

RHEL 5.5, driver 304.54, CUDA 5.0, Quadro 5000 GPU.

I was not able to reproduce the issue with the following:

RHEL 5.5, driver 319.72, CUDA 5.5, Quadro 5000 GPU.

Please update your CUDA install to CUDA 5.5, and your driver to 319.72 or newer.

answered on Stack Overflow Feb 9, 2014 by Robert Crovella

While you are trying to pinpoint the crash, it'd be good to remove implicit conversion from 0 to 0.0 in the X.fill(0);call. It's valid C++, but CUDA could have troubles allocating temporaries in the function call operator. Indeed, skimming their docs I couldn't find an answer on where such temporaries are to be allocated -- global? device? Likely it's not the problem, though, but... to be sure.

answered on Stack Overflow Apr 9, 2013 by Mike Tyukanov

User contributions licensed under CC BY-SA 3.0