I have an array A of N values that can contain duplicates. Some of the elements of A may be marked (highest bit set) denoting that some predicate, relevant to another part of the application, is true for that element. Also the "marking" kernel guarantees that if A[i] == A[j] and A[i] is marked, then A[j] is also marked.
What i am working on is a "select" operation, that selects k <= K marked elements of A in an array B. All k elements of B have to be unique and k needs to be known for subsequent processing. Note that K is predefined and in general much smaller than N. For simplicity assume that K = 32.
My idea is to launch a <<<N / BLOCK_SIZE, BLOCK_SIZE>>> kernel where threads in a block vote for up to k values in shared memory. Something like:
#define NO_VALUE 0xffffffff
__global__ void select(A, K) {
__shared__ int k;
__shared__ bool done;
__shared__ uint32_t votes[K];
int tid = blockIdx.x * blockDim.x + threadIdx.x;
votes[threadIdx.x] = NO_VALUE;
if ( threadIdx.x == 0) {
k = 0;
done = false;
}
__syncthreads();
bool voted = false;
int i = threadIdx;
while ( !done) {
if ( marked(A[tid]) && !voted) {
voted |= (atomicCAS(block_votes[k + (i % (K - k))], NO_VALUE, A[tid]) == NO_VALUE);
++i;
}
__syncthreads();
if ( threadIdx.x == 0) {
int l = remove_duplicates(votes);
done = l != k;
k = l;
}
__syncthreads();
}
}
The idea is that a thread writes to index i % k only if votes[i % k] == NO_VALUE. Then thread 0 removes duplicates and checks if more values are present (different k). Since the assumption is that K is small (32) this should be fine (however we can definitely optimize here). In the next iteration only threads that haven't voted vote, starting at the first NO_VALUE index k + i (mod the number of NO_VALUE elements, K - k). This procedure should produce up to k unique values within a block (code not tested but should give you the idea).
Now what i am looking for is a way to extend this globally. Such that in the end the array B is populated. I could for instance have each block write one of its values at B[blockIdx.x % k] and then remove the duplicates. This is fine in terms of correctness but i could potentially end up with a less values. Another approach would be to do it iteratively in the same manner as the kernel above. Only 1 thread per block does work. All blocks vote their i-th value in each iteration (if one exists). Then thread 0 in block 0 removes duplicates and we continue with the remaining indices etc. However this would require many global memory accesses and i am looking for something better. Any suggestions are appreciated.
EDIT: I am aware that i can use thrust::sort -> thrust::unique or thrust::sort -> thrust::unique_copy and then select the first K elements. However i do not want to modify A as the index-value pairs matter further down the line (case against thrust::unique), and i would really prefer to B to have length K instead of N (case against thrust::unique_copy).
User contributions licensed under CC BY-SA 3.0