Using CUDA atomicInc to get unique indices

2

I have CUDA kernel where basically each thread holds a value, and it needs to add that value to one or more lists in shared memory. So for each of those lists, it needs to get an index value (unique for that list) to put the value.

The real code is different, but there are lists like:

typedef struct {
    unsigned int numItems;
    float items[MAX_NUM_ITEMS];
} List;
__shared__ List lists[NUM_LISTS];

The values numItems are initially all set to 0, and then a __syncthreads() is done.

To add its value to the lists, each thread does:

for(int list = 0; list < NUM_LISTS; ++list) {
    if(should_add_to_list(threadIdx, list)) {
        unsigned int index = atomicInc(&lists[list].numItems, 0xffffffff);
        assert(index < MAX_NUM_ITEMS); // always true
        lists[list].items[index] = my_value;
    }
}

This works most of the time, but it seems that when making some unrelated changes in other parts of the kernel (such as not checking asserts that always succeed), sometimes two threads get the same index for one list, or indices are skipped. The final value of numSamples always becomes correct, however.

However, when using the following custom implementation for atomicInc_ instead, it seems to work correctly:

__device__ static inline uint32_t atomicInc_(uint32_t* ptr) {
    uint32_t value;
    do {
        value = *ptr;
    } while(atomicCAS(ptr, value, value + 1) != value);
    return value;
}

Are the two atomicInc functions equivalent, and is it valid to use atomicInc that way to get unique indices?

According the the CUDA programming guide, the atomic functions do not imply memory ordering constraints, and different threads can access the numSamples of different lists at the same time: could this cause it to fail?

Edit:

The real kernel looks like this:

Basically there is a list of spot blocks, containing spots. Each spot has XY coordinates (col, row). The kernel needs to find, for each spot, the spots that are in a certain window (col/row difference) around it, and put them into a list in shared memory.

The kernel is called with a fixed number of warps. A CUDA block corresponds to a group of spot blocks. (here 3) These are called the local spot blocks.

First it takes the spots from the block's 3 spot blocks, and copies them into shared memory (localSpots[]). For this it uses one warp for each spot block, so that the spots can be read coalesced. Each thread in the warp is a spot in the local spot block. The spot block indices are here hardcoded (blocks[]).

Then it goes through the surrounding spot blocks: These are all the spot blocks that may contain spots that are close enough to a spot in the local spot blocks. The surrounding spot block's indices are also hardcoded here (sblock[]).

In this example it only uses the first warp for this, and traverses sblocks[] iteratively. Each thread in the warp is a spot in the surrounding spot block. It also iterates through the list of all the local spots. If the thread's spot is close enough to the local spot: It inserts it into the local spot's list, using atomicInc to get an index.

When executed, the printf shows that for a given local spot (here the one with row=37, col=977), indices are sometimes repeated or skipped.

The real code is more complex/optimized, but this code already has the problem. Here it also only runs one CUDA block.

#include <assert.h>
#include <stdio.h>

#define MAX_NUM_SPOTS_IN_WINDOW 80

__global__ void Kernel(
    const uint16_t* blockNumSpotsBuffer,
    XGPU_SpotProcessingBlockSpotDataBuffers blockSpotsBuffers,
    size_t blockSpotsBuffersElementPitch,
    int2 unused1,
    int2 unused2,
    int unused3 ) {
    typedef unsigned int uint;

    if(blockIdx.x!=30 || blockIdx.y!=1) return;

    int window = 5;

    ASSERT(blockDim.x % WARP_SIZE == 0);
    ASSERT(blockDim.y == 1);

    uint numWarps = blockDim.x / WARP_SIZE;
    uint idxWarp = threadIdx.x / WARP_SIZE;
    int idxThreadInWarp = threadIdx.x % WARP_SIZE;

    struct Spot {
        int16_t row;
        int16_t col;
        volatile unsigned int numSamples;
        float signalSamples[MAX_NUM_SPOTS_IN_WINDOW];
    };

    __shared__ uint numLocalSpots;
    __shared__ Spot localSpots[3 * 32];

    numLocalSpots = 0;

    __syncthreads();

    ASSERT(numWarps >= 3);
    int blocks[3] = {174, 222, 270};
    if(idxWarp < 3) {
        uint spotBlockIdx = blocks[idxWarp];
        ASSERT(spotBlockIdx < numSpotBlocks.x * numSpotBlocks.y);

        uint numSpots = blockNumSpotsBuffer[spotBlockIdx];
        ASSERT(numSpots < WARP_SIZE);

        size_t inOffset = (spotBlockIdx * blockSpotsBuffersElementPitch) + idxThreadInWarp;

        uint outOffset;
        if(idxThreadInWarp == 0) outOffset = atomicAdd(&numLocalSpots, numSpots);
        outOffset = __shfl_sync(0xffffffff, outOffset, 0, 32);

        if(idxThreadInWarp < numSpots) {
            Spot* outSpot = &localSpots[outOffset + idxThreadInWarp];
            outSpot->numSamples = 0;

            uint32_t coord = blockSpotsBuffers.coord[inOffset];
            UnpackCoordinates(coord, &outSpot->row, &outSpot->col);
        }
    }



    __syncthreads();


    int sblocks[] = { 29,30,31,77,78,79,125,126,127,173,174,175,221,222,223,269,270,271,317,318,319,365,366,367,413,414,415 };
    if(idxWarp == 0) for(int block = 0; block < sizeof(sblocks)/sizeof(int); ++block) {
        uint spotBlockIdx = sblocks[block];
        ASSERT(spotBlockIdx < numSpotBlocks.x * numSpotBlocks.y);

        uint numSpots = blockNumSpotsBuffer[spotBlockIdx];
        uint idxThreadInWarp = threadIdx.x % WARP_SIZE;
        if(idxThreadInWarp >= numSpots) continue;

        size_t inOffset = (spotBlockIdx * blockSpotsBuffersElementPitch) + idxThreadInWarp;

        uint32_t coord = blockSpotsBuffers.coord[inOffset];
        if(coord == 0) return; // invalid surrounding spot

        int16_t row, col;
        UnpackCoordinates(coord, &row, &col);

        for(int idxLocalSpot = 0; idxLocalSpot < numLocalSpots; ++idxLocalSpot) {
            Spot* localSpot = &localSpots[idxLocalSpot];

            if(localSpot->row == 0 && localSpot->col == 0) continue;
            if((abs(localSpot->row - row) >= window) && (abs(localSpot->col - col) >= window)) continue;

            int index = atomicInc_block((unsigned int*)&localSpot->numSamples, 0xffffffff);
            if(localSpot->row == 37 && localSpot->col == 977) printf("%02d  ", index); // <-- sometimes indices are skipped or duplicated

            if(index >= MAX_NUM_SPOTS_IN_WINDOW) continue; // index out of bounds, discard value for median calculation
            localSpot->signalSamples[index] = blockSpotsBuffers.signal[inOffset];
        }
    } }

Output looks like this:

00  01  02  03  04  05  06  07  08  09  10  11  12  13  14  15  16  17  18  19  20  21  22  23  23                                                                                                                   
00  01  02  03  04  05  06  07  08  09  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24                                                                                                                 
00  01  02  03  04  05  06  07  08  09  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24        
00  01  02  02  03  03  04  05  06  07  08  09  10  11  12  06  13  14  15  16  17  18  19  20  21        
00  01  02  03  04  05  06  07  08  09  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24        
00  01  02  03  04  05  06  07  08  09  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24        
00  01  02  03  04  05  06  07  08  09  10  11  12  13  14  15  16  17  18  19  20  21  22  23  23        
00  01  02  03  04  05  06  07  08  09  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24        
00  01  02  03  04  05  06  07  08  09  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24    

Each line is the output of one execution (the kernel is run multiple times). It is expected that indices appear in different orders. But for example on the third-last line, index 23 is repeated.

Using atomicCAS seems to fix it. Also using __syncwarp() between executions on the outer for-loop seems to fix it. But it is not clear why, and if that always fixes it.


Edit 2: This is a full program (main.cu) that shows the problem:

https://pastebin.com/cDqYmjGb

The CMakeLists.txt:

https://pastebin.com/iB9mbUJw

Must be compiled with -DCMAKE_BUILD_TYPE=Release.

It produces this output:

00(0:00000221E40003E0)
01(2:00000221E40003E0)
02(7:00000221E40003E0)
03(1:00000221E40003E0)
03(2:00000221E40003E0)
04(3:00000221E40003E0)
04(1:00000221E40003E0)
05(4:00000221E40003E0)
06(6:00000221E40003E0)
07(2:00000221E40003E0)
08(3:00000221E40003E0)
09(6:00000221E40003E0)
10(3:00000221E40003E0)
11(5:00000221E40003E0)
12(0:00000221E40003E0)
13(1:00000221E40003E0)
14(3:00000221E40003E0)
15(1:00000221E40003E0)
16(0:00000221E40003E0)
17(3:00000221E40003E0)
18(0:00000221E40003E0)
19(2:00000221E40003E0)
20(4:00000221E40003E0)
21(4:00000221E40003E0)
22(1:00000221E40003E0)

For example the lines with 03 show that two threads (1 and 2), get the same result (3), after calling atomicInc_block on the same counter (at 0x00000221E40003E0).

cuda
atomic
memory-barriers
asked on Stack Overflow May 17, 2021 by tmlen • edited May 18, 2021 by tmlen

0 Answers

Nobody has answered this question yet.


User contributions licensed under CC BY-SA 3.0