cuda select k unique values matching predicate

-1

I have an array A of N values that can contain duplicates. Some of the elements of A may be marked (highest bit set) denoting that some predicate, relevant to another part of the application, is true for that element. Also the "marking" kernel guarantees that if A[i] == A[j] and A[i] is marked, then A[j] is also marked.

What i am working on is a "select" operation, that selects k <= K marked elements of A in an array B. All k elements of B have to be unique and k needs to be known for subsequent processing. Note that K is predefined and in general much smaller than N. For simplicity assume that K = 32.

My idea is to launch a <<<N / BLOCK_SIZE, BLOCK_SIZE>>> kernel where threads in a block vote for up to k values in shared memory. Something like:

#define NO_VALUE 0xffffffff

__global__ void select(A, K) {

  __shared__ int k;
  __shared__ bool done;
  __shared__ uint32_t votes[K];

  int tid = blockIdx.x * blockDim.x + threadIdx.x;  

  votes[threadIdx.x] = NO_VALUE;

  if ( threadIdx.x == 0) {
    k = 0;
    done = false;
  }

  __syncthreads();

  bool voted = false;
  int i = threadIdx;

  while ( !done) {
    if ( marked(A[tid]) && !voted) {
      voted |= (atomicCAS(block_votes[k + (i % (K - k))], NO_VALUE, A[tid]) == NO_VALUE);
      ++i;  
    }

    __syncthreads();

    if ( threadIdx.x == 0) {
      int l = remove_duplicates(votes); 
      done = l != k;
      k = l;
    }

    __syncthreads();
  }  
}

The idea is that a thread writes to index i % k only if votes[i % k] == NO_VALUE. Then thread 0 removes duplicates and checks if more values are present (different k). Since the assumption is that K is small (32) this should be fine (however we can definitely optimize here). In the next iteration only threads that haven't voted vote, starting at the first NO_VALUE index k + i (mod the number of NO_VALUE elements, K - k). This procedure should produce up to k unique values within a block (code not tested but should give you the idea).

Now what i am looking for is a way to extend this globally. Such that in the end the array B is populated. I could for instance have each block write one of its values at B[blockIdx.x % k] and then remove the duplicates. This is fine in terms of correctness but i could potentially end up with a less values. Another approach would be to do it iteratively in the same manner as the kernel above. Only 1 thread per block does work. All blocks vote their i-th value in each iteration (if one exists). Then thread 0 in block 0 removes duplicates and we continue with the remaining indices etc. However this would require many global memory accesses and i am looking for something better. Any suggestions are appreciated.

EDIT: I am aware that i can use thrust::sort -> thrust::unique or thrust::sort -> thrust::unique_copy and then select the first K elements. However i do not want to modify A as the index-value pairs matter further down the line (case against thrust::unique), and i would really prefer to B to have length K instead of N (case against thrust::unique_copy).

cuda
filtering
unique
asked on Stack Overflow Jul 17, 2019 by qbit • edited Jul 18, 2019 by qbit

0 Answers

Nobody has answered this question yet.


User contributions licensed under CC BY-SA 3.0