I have an array A
of N
values that can contain duplicates. Some of the elements of A
may be marked (highest bit set) denoting that some predicate, relevant to another part of the application, is true for that element. Also the "marking" kernel guarantees that if A[i] == A[j]
and A[i]
is marked, then A[j]
is also marked.
What i am working on is a "select" operation, that selects k <= K
marked elements of A
in an array B
. All k
elements of B
have to be unique and k
needs to be known for subsequent processing. Note that K
is predefined and in general much smaller than N
. For simplicity assume that K = 32
.
My idea is to launch a <<<N / BLOCK_SIZE, BLOCK_SIZE>>>
kernel where threads in a block vote for up to k
values in shared memory. Something like:
#define NO_VALUE 0xffffffff
__global__ void select(A, K) {
__shared__ int k;
__shared__ bool done;
__shared__ uint32_t votes[K];
int tid = blockIdx.x * blockDim.x + threadIdx.x;
votes[threadIdx.x] = NO_VALUE;
if ( threadIdx.x == 0) {
k = 0;
done = false;
}
__syncthreads();
bool voted = false;
int i = threadIdx;
while ( !done) {
if ( marked(A[tid]) && !voted) {
voted |= (atomicCAS(block_votes[k + (i % (K - k))], NO_VALUE, A[tid]) == NO_VALUE);
++i;
}
__syncthreads();
if ( threadIdx.x == 0) {
int l = remove_duplicates(votes);
done = l != k;
k = l;
}
__syncthreads();
}
}
The idea is that a thread writes to index i % k
only if votes[i % k] == NO_VALUE
. Then thread 0
removes duplicates and checks if more values are present (different k
). Since the assumption is that K
is small (32) this should be fine (however we can definitely optimize here). In the next iteration only threads that haven't voted vote, starting at the first NO_VALUE
index k + i
(mod the number of NO_VALUE
elements, K - k
). This procedure should produce up to k
unique values within a block (code not tested but should give you the idea).
Now what i am looking for is a way to extend this globally. Such that in the end the array B
is populated. I could for instance have each block write one of its values at B[blockIdx.x % k]
and then remove the duplicates. This is fine in terms of correctness but i could potentially end up with a less values. Another approach would be to do it iteratively in the same manner as the kernel above. Only 1 thread per block does work. All blocks vote their i
-th value in each iteration (if one exists). Then thread 0
in block 0
removes duplicates and we continue with the remaining indices etc. However this would require many global memory accesses and i am looking for something better. Any suggestions are appreciated.
EDIT: I am aware that i can use thrust::sort
-> thrust::unique
or thrust::sort
-> thrust::unique_copy
and then select the first K
elements. However i do not want to modify A
as the index-value pairs matter further down the line (case against thrust::unique
), and i would really prefer to B
to have length K
instead of N
(case against thrust::unique_copy
).
User contributions licensed under CC BY-SA 3.0