Given a mask and a value, the mask covers the value if all bits from the value fall into the mask.

For example:

```
mask: 0b011010
value: 0b010010
true
```

or

```
mask: 0b011010
value: 0b010110
false
```

For `int arr[arr_size]`

, I need to calculate how many elements of the array are covered by the given mask.

My code:

```
int count = 0;
for (int index = 0; index < arr_size; index++)
{
// note: always true because of operator precedence
if (arr[index] | mask == mask)
count++;
}
```

or (slower)

```
int count = 0;
for (int index = 0; index < arr_size; index++)
{
// note: also broken because of operator precedence, but not always true
if (arr[index] & mask == arr[index])
count++;
}
```

My program very often needs to calculate the number of such array elements.

Can you tell me if there is any way to speed up such calculations? For example using SSE, AVX instructions.

P.S.

My code is turned into 5 instructions by the compiler (with optimizations enabled), but maybe you should use group instructions and this will give an additional speed gain

P.P.S. minimal code:

```
constexpt block_size = 16;
int arr[] = {rand(), rand(), rand(), rand(), rand(), rand(), rand(), rand(), rand(), rand(), rand(), rand(), rand(), rand(), rand(), rand(), }; // random values for example
int mask = rand();
int count = 0;
for (__int64 cycles = 0; cycles < 0xFFFFFFFF; cycles ++)
{
for (int index = 0; index < block_size; index ++)
{
if (arr[index] | mask == mask)
count++;
}
}
```

AVX1 is useless for this, it only does floating point stuff and you have integers there. Here’s AVX2 port of your code.

```
#include <immintrin.h>
#include <assert.h>
// Compute horizontal sum of all int32 lanes of the vector
inline int horizontalSum( const __m256i v )
{
// Add 8 scalars into 4, by extracting lower/higher pieces from the source vector
const __m128i r4 = _mm_add_epi32( _mm256_castsi256_si128( v ),
_mm256_extracti128_si256( v, 1 ) );
// Add 4 scalars into 2, using shift by 8 bytes
const __m128i r2 = _mm_add_epi32( r4, _mm_srli_si128( r4, 8 ) );
// Add 2 lowest scalars, using shift by 4 bytes
const __m128i r1 = _mm_add_epi32( r2, _mm_srli_si128( r2, 4 ) );
// Move the lowest int32 into scalar register
return _mm_cvtsi128_si32( r1 );
}
int countMaskedAvx2( const int* src, size_t length, int mask )
{
assert( length <= INT_MAX );
const int* const srcEndAligned = src + ( length / 8 ) * 8;
const int* const srcEnd = src + length;
// Broadcast mask into all 8 lanes of a vector register
const __m256i maskVector = _mm256_set1_epi32( mask );
// Initialize counters with zero vector
__m256i counters = _mm256_setzero_si256();
// Handle vast majority of data with AVX2 code
while( src < srcEndAligned )
{
// Load 8 scalars from the array
__m256i v = _mm256_loadu_si256( ( const __m256i * )src );
src += 8;
// val | mask
v = _mm256_or_si256( v, maskVector );
// ( val | mask ) == mask
v = _mm256_cmpeq_epi32( v, maskVector );
// Increment the counters.
// The result of that comparison instruction is 0 for false, or 0xFFFFFFFF for true.
// Conveniently, 0xFFFFFFFF bit pattern equals to -1 signed integer.
counters = _mm256_sub_epi32( counters, v );
}
// Compute horizontal sum of the counters vector
int counter = horizontalSum( counters );
// Handle the final 1-7 scalars of the source array
while( src < srcEnd )
{
const int v = mask | ( *src );
src++;
const int add = ( v == mask ) ? 1 : 0;
counter += add;
}
return counter;
}
```

If you don’t have AVX2, the same approach is doable with SSE2. You can rely on the support of that, all 64-bit x86 processors are required to support at least SSE1 and SSE2. SSE2 version will run twice as many instructions compared to AVX2, but still it gonna use 128-bit wide memory loads, and 4-wide arithmetic instructions. Even with SSE2, I would expect measurable performance improvement compared to your original code.

**Update:** for optimal performance of the above code, make sure your input vector is aligned by 32 bytes (AVX2) or 16 bytes (SSE2). If you have C++/11 or newer, `alignas(32)`

is often good enough.

User contributions licensed under CC BY-SA 3.0