_loaddqu_LE
intrinsic stores in reverse order. Please suggest a workaround or use array to first rearrange bytes before using _loaddqu_LE
.
#include <stdio.h>
int main() {
uint32_t src[16];
__m128i a; /* 128 bit */
src[0] = 0x00000000;
src[1] = 0x00000000;
src[2] = 0x00000000;
src[3] = 0x00000000;
src[4] = 0x63636362;
src[5] = 0x63636362;
src[6] = 0x63636362;
src[7] = 0x63636362;
src[8] = 0xc998989b;
src[9] = 0xaafbfbf9;
src[10] =0xc998989b;
src[11] =0xaafbfbf9;
src[12] =0x50349790;
src[13] =0xfacf6c69;
src[14] =0x3357f4f2;
src[15] =0x99ac0f0b;
/* load 32 bits */
a = _loaddqu_LE((const char _ptr64 *) & (((__m128i *)src)[0]));
printf("0x%016llx%016llx\n", a.v0, a.v1);
a = _loaddqu_LE((const char _ptr64 *) & (((__m128i *)src)[1]));
printf("0x%016llx%016llx\n", a.v0, a.v1);
return 0;
}
Actual output:
0x00000000000000000000000000000000 0x62636363626363636263636362636363
Expected output:
0x00000000000000000000000000000000 0x63636362636363626363636263636362
Let's say you have a 128-bit unsigned integer
28018020645823955151501786048551321856
In hexadecimal, it is
0x15141312111009080706050403020100
On architectures that use little-endian byte order, like 64-bit Intel/AMD (which is the most likely candidate, considering the __m128i
type used), that number is stored in memory in hexadecimal as
0x00 0x01 0x02 0x03 0x04 0x05 0x06 0x07 0x08 0x09 0x10 0x11 0x12 0x13 0x14 0x15
We can reinterpret these bytes for example as eight 16-bit unsigned integers,
0x0100 0x0302 0x0504 0x0706 0x0908 0x1110 0x1312 0x1514
or four 32-bit unsigned integers,
0x03020100 0x07060504 0x11100908 0x15141312
or two 64-bit unsigned integers,
0x0706050403020100 0x1514131211100908
OP wishes to split the 128-bit unsigned integer input into two 64-bit unsigned integers. The Intel/AMD intrinsics provide the _mm_shuffle_epi8()
and _mm_set_epi8()
intrinsics for this. (If OP is using TNS/X C/C++, the equivalent intrinsics are _pshufb()
and _mm_set_epi8()
.)
The _mm_set_epi8()
intrinsic takes 16 parameters, most significant byte first, and packs them into an 128-bit integer. The _mm_shuffle_epi8()
/_pshufb()
intrinsics take two 128-bit integers as parameters, and returns an 128-bit integer, constructed from the bytes in the first parameter, as directed by bytes in the second parameter.
Here are some useful byte order constants:
/* SWAP128_128 = _mm_set_epi8(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15); */
#define SWAP128_128 { 579005069656919567LL, 283686952306183LL }
/* SWAP128_64 = _mm_set_epi8(8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 4, 5, 6, 7); */
#define SWAP128_64 { 283686952306183LL, 579005069656919567LL };
/* SWAP128_32 = _mm_set_epi8(12, 13, 14, 15, 8, 9, 10, 11, 4, 5, 6, 7, 0, 1, 2, 3); */
#define SWAP128_32 { 289644378169868803LL, 868365760874482187LL };
/* SWAP128_16 = _mm_set_epi8(14, 15, 12, 13, 10, 11, 8, 9, 6, 7, 4, 5, 2, 3, 0, 1); */
#define SWAP128_16 { 434320308619640833LL, 1013041691324254217LL };
const __m128i swap128_128 = SWAP128_128;
const __m128i swap128_64 = SWAP128_64;
const __m128i swap128_32 = SWAP128_32;
const __m128i swap128_16 = SWAP128_16;
Note that the constant declaration assumes that the C compiler implements the __m128i
type as if it was two long long
s (as far as I know, all that support those for SSE3 do). In any case, you can construct the constants using the _mm_set_epi8()
intrinsic.
The reason for putting them as macros, is that if you encounter a compiler or architecture that requires different type of declaration to get the same effective value (as the respective _mm_set_epi8()
intrinsic yields), you only need a small bit of preprocessor massaging.
Using the above, a = _mm_shuffle_epi8(a, swap128_128);
(or a = _pshufb(a, swap128_128)
for TNS/X C/C++) reverses the entire byte order; swap128_64
just the byte order for both 64-bit components, swap128_32
for all four 32-bit components, and swap128_16
for all eight 16-bit components. There are eleven other variations (plus "no shuffle", for a total of 16 byte orders for 128 bit values), plus you can duplicate source bytes to target bytes, so do use the _mm_set_epi8()
to find the one you need.
Given the above data,
const uint8_t data[16] = {
0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
0x08, 0x09, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15
};
__m128i vector = _mm_lddqu_si128((const __m128i *)data);
__m128i v128 = _mm_shuffle_epi8(vector, swap128_128);
__m128i v64 = _mm_shuffle_epi8(vector, swap128_64);
__m128i v32 = _mm_shuffle_epi8(vector, swap128_32);
__m128i v16 = _mm_shuffle_epi8(vector, swap128_16);
will yield:
vector = 0x0706050403020100 0x1514131211100908
= 0x03020100 0x07060504 0x11100908 0x15141312
= 0x0100 0x0302 0x0504 0x0706 0x0908 0x1110 0x1312 0x1514
= 0x00 0x01 0x02 0x03 0x04 0x05 0x06 0x07 0x08 0x09 0x10 0x11 0x12 0x13 0x14 0x15
v128 = 0x0809101112131415 0x0001020304050607
= 0x12131415 0x08091011 0x04050607 0x00010203
= 0x1415 0x1213 0x1011 0x0809 0x0607 0x0405 0x0203 0x0001
= 0x15 0x14 0x13 0x12 0x11 0x10 0x09 0x08 0x07 0x06 0x05 0x04 0x03 0x02 0x01 0x00
v64 = 0x0001020304050607 0x0809101112131415
= 0x04050607 0x00010203 0x12131415 0x08091011
= 0x0607 0x0405 0x0203 0x0001 0x1415 0x1213 0x1011 0x0809
= 0x07 0x06 0x05 0x04 0x03 0x02 0x01 0x00 0x15 0x14 0x13 0x12 0x11 0x10 0x09 0x08
v32 = 0x0405060700010203 0x1213141508091011
= 0x00010203 0x04050607 0x08091011 0x12131415
= 0x0203 0x0001 0x0607 0x0405 0x1011 0x0809 0x1415 0x1213
= 0x03 0x02 0x01 0x00 0x07 0x06 0x05 0x04 0x11 0x10 0x09 0x08 0x15 0x14 0x13 0x12
v16 = 0x0607040502030001 0x1415121310110809
= 0x02030001 0x06070405 0x10110809 0x14151213
= 0x0001 0x0203 0x0405 0x0607 0x0809 0x1011 0x1213 0x1415
= 0x01 0x00 0x03 0x02 0x05 0x04 0x07 0x06 0x09 0x08 0x11 0x10 0x13 0x12 0x15 0x14
depending on how you wish to interpret each __m128i
. (The first one is as two 64-bit integers, second as four 32-bit integers, third as eight 16-bit integers, and fourth as sixteen bytes.)
There are many other possible variations (for 128 bit values, 16 unique byte orders are possible), but without knowing exactly what is the underlying problem and what it is that OP is trying to achieve, I won't bother exploring them all.
User contributions licensed under CC BY-SA 3.0