Count integers in an array where the set bits are a subset of a given mask

-1

Given a mask and a value, the mask covers the value if all bits from the value fall into the mask.

For example:

mask:  0b011010
value: 0b010010

true

or

mask:  0b011010
value: 0b010110

false

For int arr[arr_size], I need to calculate how many elements of the array are covered by the given mask.

My code:

int count = 0;

for (int index = 0; index < arr_size; index++)
{
    // note: always true because of operator precedence 
    if (arr[index] | mask == mask)
        count++;
}

or (slower)

int count = 0;

for (int index = 0; index < arr_size; index++)
{
    // note: also broken because of operator precedence, but not always true
    if (arr[index] & mask == arr[index])
        count++;
}

My program very often needs to calculate the number of such array elements.

Can you tell me if there is any way to speed up such calculations? For example using SSE, AVX instructions.


P.S.

My code is turned into 5 instructions by the compiler (with optimizations enabled), but maybe you should use group instructions and this will give an additional speed gain


P.P.S. minimal code:

constexpt block_size = 16;
int arr[] = {rand(), rand(), rand(), rand(), rand(), rand(), rand(), rand(), rand(), rand(), rand(), rand(), rand(), rand(), rand(), rand(), }; // random values for example
int mask = rand(); 

int count = 0;

for (__int64 cycles = 0; cycles < 0xFFFFFFFF; cycles ++)
{
    for (int index = 0; index < block_size; index ++)
    {
        if (arr[index] | mask == mask)
            count++;
    }
}
c++
optimization
sse
avx
bitmask
asked on Stack Overflow Jan 31, 2021 by Zhihar • edited Feb 1, 2021 by Peter Cordes

1 Answer

2

AVX1 is useless for this, it only does floating point stuff and you have integers there. Here’s AVX2 port of your code.

#include <immintrin.h>
#include <assert.h>

// Compute horizontal sum of all int32 lanes of the vector
inline int horizontalSum( const __m256i v )
{
    // Add 8 scalars into 4, by extracting lower/higher pieces from the source vector
    const __m128i r4 = _mm_add_epi32( _mm256_castsi256_si128( v ),
        _mm256_extracti128_si256( v, 1 ) );
    // Add 4 scalars into 2, using shift by 8 bytes
    const __m128i r2 = _mm_add_epi32( r4, _mm_srli_si128( r4, 8 ) );
    // Add 2 lowest scalars, using shift by 4 bytes
    const __m128i r1 = _mm_add_epi32( r2, _mm_srli_si128( r2, 4 ) );
    // Move the lowest int32 into scalar register
    return _mm_cvtsi128_si32( r1 );
}

int countMaskedAvx2( const int* src, size_t length, int mask )
{
    assert( length <= INT_MAX );

    const int* const srcEndAligned = src + ( length / 8 ) * 8;
    const int* const srcEnd = src + length;
    // Broadcast mask into all 8 lanes of a vector register
    const __m256i maskVector = _mm256_set1_epi32( mask );
    // Initialize counters with zero vector
    __m256i counters = _mm256_setzero_si256();

    // Handle vast majority of data with AVX2 code
    while( src < srcEndAligned )
    {
        // Load 8 scalars from the array
        __m256i v = _mm256_loadu_si256( ( const __m256i * )src );
        src += 8;
        // val | mask
        v = _mm256_or_si256( v, maskVector );
        // ( val | mask ) == mask
        v = _mm256_cmpeq_epi32( v, maskVector );
        // Increment the counters.
        // The result of that comparison instruction is 0 for false, or 0xFFFFFFFF for true.
        // Conveniently, 0xFFFFFFFF bit pattern equals to -1 signed integer.
        counters = _mm256_sub_epi32( counters, v );
    }

    // Compute horizontal sum of the counters vector
    int counter = horizontalSum( counters );

    // Handle the final 1-7 scalars of the source array
    while( src < srcEnd )
    {
        const int v = mask | ( *src );
        src++;
        const int add = ( v == mask ) ? 1 : 0;
        counter += add;
    }
    return counter;
}

If you don’t have AVX2, the same approach is doable with SSE2. You can rely on the support of that, all 64-bit x86 processors are required to support at least SSE1 and SSE2. SSE2 version will run twice as many instructions compared to AVX2, but still it gonna use 128-bit wide memory loads, and 4-wide arithmetic instructions. Even with SSE2, I would expect measurable performance improvement compared to your original code.

Update: for optimal performance of the above code, make sure your input vector is aligned by 32 bytes (AVX2) or 16 bytes (SSE2). If you have C++/11 or newer, alignas(32) is often good enough.

answered on Stack Overflow Feb 1, 2021 by Soonts • edited Feb 1, 2021 by Soonts

User contributions licensed under CC BY-SA 3.0