How does __shfl_sync()
(and similar) functions behave when exceeding maximum threadIdx
? Or attempting to read threads that have exited? In the documentation I can read two conflicting sentences that leave me puzzled. The following statements are taken from the CUDA C Programming Guide
Statement 1:
If the source lane ID is out of range or the source thread has exited, the calling thread's own var is returned.
Statement 2:
Threads may only read data from another thread which is actively participating in the __shfl_sync() command. If the target thread is inactive, the retrieved value is undefined.
But the thread that has existed, must have diverged and is not actively participating in the __shfl_sync()
.
For example, consider:
__global__ void kernel() {
int val = threadIdx.x;
if (threadIdx.x>5)
return;
__syncthreads();
val = __shfl_down_sync(0x7fffffff, val, +5);
printf("thread[%d].val = %d\n", threadIdx.x, val);
}
.....
kernel<<<1,32>>>()
Is the resulting val
unchanged or undefined?
When I actually tested, I got the output:
thread[0].val = 5
thread[1].val = 0
thread[2].val = 0
thread[3].val = 0
thread[4].val = 0
thread[5].val = 0
which would suggest that the value is in fact undefined. But then what it means that a thread has exited?
I get the same result when I remove the conditional return, but invoke the kernel with just 6 threads. So ID that are too high still give the undefined value?
So is the Statement 1 completely wrong? Or I misunderstand it somehow?
User contributions licensed under CC BY-SA 3.0