shuffle intrinsics with non-default mask providing data from inactive threads to active threads

0

I'm using CUDA 9 on a Pascal architecture, trying to implement a reasonable block reduction using warp shuffle intrinsics plus a shared memory intermediate step.

Examples I've seen on the web:

The first of those links illustrate the shuffle intrinsics with _sync, and how to use __ballot_sync(), but only goes as far as a single warp reduction.

The second of those links is a Kepler-era article that doesn't use the newer _sync but does illustrate a full block level reduction by staging individual warp reductions into shared memory, then reading those values back into warp 0 and doing one more warp reduction to achieve a block reduction.

My problem is different from these and other examples I've seen on the web is that my reduction operator isn't a simple sum, and my "N" usually won't be a nice power of 2. From my debugging efforts, it seems that when an active thread (included in the mask provided by __ballot_sync() tries to obtain a value from an inactive thread (not included in the mask), it retrieves a "0". A "0" would work fine regardless for a sum reduction, but not for a min reduction. ).

take the following code excerpt:

__device__ void warpReduceMin(uint32_t &val, uint32_t mask)
{
   for (int offset=16; offset>0; offset /= 2)
   {
       uint32_t tmp;
       tmp = __shfl_down_sync(mask, val, offset);
       val = (tmp<val) ? tmp : val;
    }
}

__global__ void my_kernel(uint32_t *d_data, uint32_t N)
{
    __shared__ uint32_t shmem[32];

    if (threadIdx.x >= N) return;

    uint32_t mask = __ballot_sync(0xFFFFFFFF, threadIdx.x < blockDim.x)
    uint32_t val = d_data[threadIdx.x];
    uint32_t warp_id = threadIdx.x / warpSize;
    uint32_t lane_id = threadIdx.x % warpSize;

    warpReduceMin(val, mask);
    // val is erroneously set to "0" for the active threads in last warp
    if (lane_id == 0)
        shmem[warp_id] = val;
    __syncthreads();
    val = shmem[lane_id];
    mask = __ballot_sync(0xFFFFFFFF, threadIdx.x < (blockDim.x+warpSize-1)/warpSize );
    if (warp_id == 0)
        warpReduceMin( val, mask );

    // do something with result...

}

If I call the kernel with a block size of 1024, and I have 1024 elements in my data (N=1000)...I get the expected answer. But if I call the kernel with a block size of 1024, with N=1000, then I can see through printf debugging that my last warp of incomplete data (warp_id == 31; elements = 992:999), that the initial offset of 16 is pulling a "0" from a thread which isn't even involved in the warp.

So I'm not quite sure where my error is.

cuda
asked on Stack Overflow Jul 23, 2019 by wrjohns • edited Jul 23, 2019 by talonmies

1 Answer

1

There are probably other things I could quibble about in this code, but the proximal issue that is causing the unexpected result is that you are doing an undefined warp shuffle here:

tmp = __shfl_down_sync(mask, val, offset);

In the case of the last warp (ie. warp ID 31) in the "last" threadblock, i.e. the one where this line is preventing some threads from participating:

if (threadIdx.x >= N) return;

you have the following scenario (let's consider a single threadblock of 1024 threads, with N=1000): In the last warp, there are 8 active threads, whose threadIdx.x values are from 992 to 999, inclusive. The other threads in that warp would be "removed" by the if statement. So we have 8 active threads, and in the first pass of the warp shuffle, offset is 16. So lane 0 (thread 992) is requesting the value from lane 16 (thread 1008). But thread 1008 is not participating. That combination (either the source or destination lane not participating in the warp shuffle) is specifically identified in the programming guide as producing undefined results:

Threads may only read data from another thread which is actively participating in the __shfl_sync() command. If the target thread is inactive, the retrieved value is undefined.

Specifying the mask parameter to any particular value does not change this behavior/requirement. There is a careful description of the mask parameter here. In fact your mask is 0xFF (selecting 8 threads) entering into this problematic case, so it is "consistent" with the fact that you have 8 active threads, but doesn't address the warp-shuffle inactive source lane problem.

I think the simplest way to fix this is to make sure that each warp is fully active, and populated with an appropriate value for reduction, entering into each warp shuffle operation. If you do that, you can get rid of some other aspects of your code that I consider slightly problematic. Here's a "fixed" example:

$ cat t1456.cu
#include <stdio.h>
#include <stdint.h>
__device__ void warpReduceMin(uint32_t &val, uint32_t mask)
{
   for (int offset=16; offset>0; offset /= 2)
   {
       uint32_t tmp;
       tmp = __shfl_down_sync(mask, val, offset);
       val = (tmp<val) ? tmp : val;
    }
}

__global__ void my_kernel(uint32_t *d_data, uint32_t N)
{
    __shared__ uint32_t shmem[32];

    uint32_t mask = 0xFFFFFFFFU;
    uint32_t val = (threadIdx.x < N)?(d_data[threadIdx.x]):0xFFFFFFFFU;
    uint32_t warp_id = threadIdx.x / warpSize;
    uint32_t lane_id = threadIdx.x % warpSize;

    warpReduceMin(val, mask);
    if (lane_id == 0)
        shmem[warp_id] = val;
    __syncthreads();
    val = shmem[lane_id];
    if (warp_id == 0)
        warpReduceMin( val, mask );
    if (threadIdx.x == 0)
      printf("val = %u\n", val);
}

int main(){
  const uint32_t N = 1000;
  uint32_t *d_data, *h_data = (uint32_t *)malloc(N*sizeof(uint32_t));
  cudaMalloc(&d_data, N*sizeof(uint32_t));
  for (int i = 0; i < N; i++)
    h_data[i] = i+1;
  cudaMemcpy(d_data, h_data, N*sizeof(uint32_t), cudaMemcpyHostToDevice);
  my_kernel<<<1,1024>>>(d_data, N);
  cudaDeviceSynchronize();
}

$ nvcc -o t1456 t1456.cu
$ cuda-memcheck ./t1456
========= CUDA-MEMCHECK
val = 1
========= ERROR SUMMARY: 0 errors
$
answered on Stack Overflow Jul 24, 2019 by Robert Crovella • edited Jul 24, 2019 by Robert Crovella

User contributions licensed under CC BY-SA 3.0