I have a array
called A
that contains 32 unsigned char
values.
I want to unpack these values in 4 __m256
variables with this rule, assuming we have a index from 0 to 31 regarding all the values from A
, the unpacked 4 variable would have these values:
B_0 = A[0], A[4], A[8], A[12], A[16], A[20], A[24], A[28]
B_1 = A[1], A[5], A[9], A[13], A[17], A[21], A[25], A[29]
B_2 = A[2], A[6], A[10], A[14], A[18], A[22], A[26], A[30]
B_3 = A[3], A[7], A[11], A[15], A[19], A[23], A[27], A[31]
To do that, I have this code:
const auto mask = _mm256_set1_epi32( 0x000000FF );
...
const auto A_values = _mm256_i32gather_epi32(reinterpret_cast<const int*>(A.data(), A_positions.values_, 4);
// This code bellow is equivalent to B_0 = static_cast<float>((A_value >> 24) & 0x000000FF)
const auto B_0 = _mm256_cvtepi32_ps(_mm256_and_si256(_mm256_srai_epi32(A_values, 24), mask));
const auto B_1 = _mm256_cvtepi32_ps(_mm256_and_si256(_mm256_srai_epi32(A_values, 16), mask));
const auto B_2 = _mm256_cvtepi32_ps(_mm256_and_si256(_mm256_srai_epi32(A_values, 8), mask));
const auto B_3 = _mm256_cvtepi32_ps(_mm256_and_si256(_mm256_srai_epi32(A_values, 0), mask));
This works great, but I wonder if there is some faster way to do that, specially regarding the shift right and and operator that I use to retrieve the values.
Also, just for clarification, I said that array
A
was of size 32, but that's not true, this array contains way more values, and I need to access it's elements from different positions (but always from blocks of 4 uint8_t
) that's why I use _mm256_i32gather_epi23
to retrieve these values. I just restrain the array
size in this example for simplicity.
The shift/mask can be combined in a vpshufb
. Of course that means there are shuffle mask to worry about, which have to come from somewhere. If they can stay in registers it's no big deal, if they have to be loaded that may kill this technique.
This seems dubious as an optimization on Intel since the shift has a recip.throughput of 0.5 and the AND 0.33, which is better than the 1 that you'd get with a shuffle (Intel processors with two shuffle units did not support AVX2 so they are not relevant, so the shuffle goes to P5). It's still fewer µops, so in the context of other code it may or may not be worth doing, depending on what the bottle neck is. If the rest of the code just uses P01 (typical for FP SIMD), moving µops to P5 is probably a good idea.
On Ryzen it is generally better since vector shifts have a low throughput there. A 256b vpsrad
generates 2 µops that both have to go to port 2 (and then there two more µops for the vpand
, but they can go to any of four alu ports), 256b vpshufb
generates 2 µops that can go to ports 1 and 2. On the other hand, gather is so bad on Ryzen that this is all just noise compared to the huge flood of µops from that. You could gather manually but then it's still a lot of µops, and they'll likely go to P12 which makes this technique bad.
In conclusion I can't tell you whether this is actually faster or not, it depends.
User contributions licensed under CC BY-SA 3.0