How to concurrently write and read CUDA array with unique incrementing values?

0

I have a shared memory array initialized as follows

#define UNDEFINED 0xffffffff
#define DEFINED   0xfffffffe

__shared__ unsigned int array[100];
__shared__ count;

// We have enough threads: blockDim.x > 100
array[threadIdx.x] = UNDEFINED;


// Initialize count
if (threadIdx.x == 0)
  count = 0;

The threads have random access to array. When a thread access array, if it is UNDEFINED, it must write a unique value, count, to that element, and then read that value. If the array element is DEFINED or already has a unique value, it must just read the unique value out. The tricky part is that array and count must both be updated by only 1 thread. Atomic functions only update 1 variable not 2. Here's the method that I finally came up with for 1 thread to update both variables while blocking the other threads until it is done.

value = atomicCAS(&array[randomIndex], UNDEFINED, DEFINED);
if (value == UNDEFINED) {
    value = atomicAdd(&count, 1);
    array[randomIndex] = value;
} 

// For case that value == DEFINED_SOURCe, wait for memory
// writes, then store value
__threadfence_block();
value = array[randomSource];

There is some tricky concurrency going on here. I'm not sure that this will work for all cases. Are there better suggestions or comments?

cuda
asked on Stack Overflow Jan 15, 2016 by roger1994

2 Answers

2

According to your description, the only time an array element will be written to is if it contains the value UNDEFINED. We can leverage this.

  1. A thread will first do an atomicCAS operation on the desired array element. The atomicCAS will be configured to check for the UNDEFINED value. If it is present, it will replace it with DEFINED. If it is not present, it will not replace it.

  2. Based on the return result from atomicCAS, the thread will know if the array element contained UNDEFINED or not. If it did, then the return result from the atomicCAS will be UNDEFINED, and the thread will then go and retrieve the desired unique value from count, and use that to modify the DEFINED value to the desired unique value.

we can do this in one line of code:

// assume idx contains the desired offset into array
if (atomicCAS(array+idx, UNDEFINED, DEFINED) == UNDEFINED) array[idx]=atomicAdd(&count, 1);

A more complete code could be like this:

value = DEFINED;
while (value == DEFINED){
  value = atomicCAS(&array[randomIndex], UNDEFINED, DEFINED);
  if (value == UNDEFINED) {
    value = atomicAdd(&count, 1);
    array[randomIndex] = value;}
  }

 // value now contains the unique value, 
 // either that was already present in array[randomIndex] 
 // or the value that was just written there
answered on Stack Overflow Jan 15, 2016 by Robert Crovella • edited Jan 15, 2016 by Robert Crovella
0

For have an array of incrementing values, use prefx-sum also called scan algorithms, based on binary tree ower threads. First over local block(shared memory in the name) ? then global over blocks, then add each summ back to each block. Also it may be efficient for each block to read not one but some values, what are equal of physically "warp size" like 16 int values for example ( i apologize, because i have done this things long time ago and don't know proper sizes and proper names for this things in CUDA). Ahh, btw,the final values, in case of equal incrementing, could be retrieved as the function from local or global thread.id, so you do not need scan at all

answered on Stack Overflow Jan 15, 2016 by J J • edited Jan 15, 2016 by J J

User contributions licensed under CC BY-SA 3.0