CUDA memcheck address - how to determine location in code?


cuda-memcheck is reporting this information for a release mode CUDA kernel:

========= Error: process didn't terminate successfully
========= Invalid __global__ read of size 4
=========     at 0x000002c8 in xx_kernel
=========     by thread (0,0,0) in block (0,0)
=========     Address 0x10101600014 is out of bounds
========= ERROR SUMMARY: 1 error

This fault only happens in release mode. It also doesn't happen when running under cuda-gdb.

How can I take the 0x000002c8 address and determine the code that is causing the fault? I've looked through the cached intermediate files (.ptx, .cubin, etc) and see no obvious way to determine the faulty source code.

This is on x86_64 Linux with CUDA 3.2.

UPDATE: Turns out it was a compiler bug in 3.2. Upgrading to 4.0 makes the memcheck error go away. Also, I was able to disassemble the CUBIN with the cuobjdump from 4.0, but since it was release mode and optimized, it was exceedingly difficult to match the disassembly to the source code.

asked on Stack Overflow Jun 23, 2011 by dwelch91 • edited Jun 25, 2011 by dwelch91

2 Answers


Download the CUDA Toolkit 4.0 from the NVIDIA Developer Zone. Use the new cuobjdump that supports 2.x cubins.

cuobjdump -sass /path/to/your/cubin > /path/to/dump.txt.

Example output (tested on a sm_20 cubin, code version 2.3)

/*6018*/     /*0xe00100075003ff9a*/     CAL 0x46d8;
/*6020*/     /*0x10001de428000000*/     MOV R0, R4;
/*6028*/     /*0x00001de428000000*/     MOV R0, R0;
/*6030*/     /*0x40011de428000000*/     MOV R4, R16;
answered on Stack Overflow Jun 24, 2011 by user703016

This kind of errors within a kernel is tied to a memory access which is not only based upon the thread identifier.

Considering that every memory area you use has been correctly allocated for the GPU, access based only upon something like threadIdx.x shouldn't cause any problem. Thus:

  • either you have a wrong index calculation (it is frequent with expressions like data[blockDim.y * blockDim.x * threadIdx.z + blockDim.x * threadIdx.y + threadIdx.x] for instance)
  • or you use another variable in your index calculation which make it exceed your array bounds (for example data[threadIdx.x + offset])

---- edit (following comments) ----
See @Cicada's answer for the complement on cuobjdump for device > 2.x

answered on Stack Overflow Jun 23, 2011 by jopasserat • edited Jun 24, 2011 by jopasserat

User contributions licensed under CC BY-SA 3.0