_loaddqu_LE intrinsic stores in reverse order

Question

_loaddqu_LE intrinsic stores in reverse order

_loaddqu_LE intrinsic stores in reverse order. Please suggest a workaround or use array to first rearrange bytes before using _loaddqu_LE.

#include <stdio.h>

int main() {
uint32_t src[16];
__m128i a; /* 128 bit */

src[0] = 0x00000000;
src[1] = 0x00000000;
src[2] = 0x00000000;
src[3] = 0x00000000;
src[4] = 0x63636362;
src[5] = 0x63636362;
src[6] = 0x63636362;
src[7] = 0x63636362;
src[8] = 0xc998989b;
src[9] = 0xaafbfbf9;
src[10] =0xc998989b; 
src[11] =0xaafbfbf9;
src[12] =0x50349790;
src[13] =0xfacf6c69;
src[14] =0x3357f4f2;
src[15] =0x99ac0f0b;

/* load 32 bits */
a = _loaddqu_LE((const char _ptr64 *) & (((__m128i *)src)[0]));
printf("0x%016llx%016llx\n", a.v0, a.v1);
a = _loaddqu_LE((const char _ptr64 *) & (((__m128i *)src)[1]));
printf("0x%016llx%016llx\n", a.v0, a.v1);

return 0;
}

Actual output:

0x00000000000000000000000000000000
0x62636363626363636263636362636363

Expected output:

0x00000000000000000000000000000000
0x63636362636363626363636263636362

c

x86

byte

bit

asked on Stack Overflow Jan 13, 2019 by

arsalan • edited Jan 13, 2019 by

Weather Vane

1 Answer

Let's say you have a 128-bit unsigned integer

28018020645823955151501786048551321856

In hexadecimal, it is

0x15141312111009080706050403020100

On architectures that use little-endian byte order, like 64-bit Intel/AMD (which is the most likely candidate, considering the __m128i type used), that number is stored in memory in hexadecimal as

0x00 0x01 0x02 0x03 0x04 0x05 0x06 0x07 0x08 0x09 0x10 0x11 0x12 0x13 0x14 0x15

We can reinterpret these bytes for example as eight 16-bit unsigned integers,

0x0100 0x0302 0x0504 0x0706 0x0908 0x1110 0x1312 0x1514

or four 32-bit unsigned integers,

0x03020100 0x07060504 0x11100908 0x15141312

or two 64-bit unsigned integers,

0x0706050403020100 0x1514131211100908

OP wishes to split the 128-bit unsigned integer input into two 64-bit unsigned integers. The Intel/AMD intrinsics provide the _mm_shuffle_epi8() and _mm_set_epi8() intrinsics for this. (If OP is using TNS/X C/C++, the equivalent intrinsics are _pshufb() and _mm_set_epi8().)

The _mm_set_epi8() intrinsic takes 16 parameters, most significant byte first, and packs them into an 128-bit integer. The _mm_shuffle_epi8()/_pshufb() intrinsics take two 128-bit integers as parameters, and returns an 128-bit integer, constructed from the bytes in the first parameter, as directed by bytes in the second parameter.

Here are some useful byte order constants:

/* SWAP128_128 = _mm_set_epi8(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15); */
#define  SWAP128_128  { 579005069656919567LL, 283686952306183LL }

/* SWAP128_64 = _mm_set_epi8(8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 4, 5, 6, 7); */
#define  SWAP128_64  { 283686952306183LL, 579005069656919567LL };

/* SWAP128_32 = _mm_set_epi8(12, 13, 14, 15, 8, 9, 10, 11, 4, 5, 6, 7, 0, 1, 2, 3); */
#define  SWAP128_32  { 289644378169868803LL, 868365760874482187LL }; 

/* SWAP128_16 = _mm_set_epi8(14, 15, 12, 13, 10, 11, 8, 9, 6, 7, 4, 5, 2, 3, 0, 1); */
#define  SWAP128_16  { 434320308619640833LL, 1013041691324254217LL };

const __m128i  swap128_128 = SWAP128_128;
const __m128i  swap128_64  = SWAP128_64;
const __m128i  swap128_32  = SWAP128_32;
const __m128i  swap128_16  = SWAP128_16;

Note that the constant declaration assumes that the C compiler implements the __m128i type as if it was two long longs (as far as I know, all that support those for SSE3 do). In any case, you can construct the constants using the _mm_set_epi8() intrinsic.

The reason for putting them as macros, is that if you encounter a compiler or architecture that requires different type of declaration to get the same effective value (as the respective _mm_set_epi8() intrinsic yields), you only need a small bit of preprocessor massaging.

Using the above, a = _mm_shuffle_epi8(a, swap128_128); (or a = _pshufb(a, swap128_128) for TNS/X C/C++) reverses the entire byte order; swap128_64 just the byte order for both 64-bit components, swap128_32 for all four 32-bit components, and swap128_16 for all eight 16-bit components. There are eleven other variations (plus "no shuffle", for a total of 16 byte orders for 128 bit values), plus you can duplicate source bytes to target bytes, so do use the _mm_set_epi8() to find the one you need.

Given the above data,

const uint8_t  data[16] = {
    0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
    0x08, 0x09, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15
};
__m128i vector = _mm_lddqu_si128((const __m128i *)data);
__m128i v128 = _mm_shuffle_epi8(vector, swap128_128);
__m128i v64 = _mm_shuffle_epi8(vector, swap128_64);
__m128i v32 = _mm_shuffle_epi8(vector, swap128_32);
__m128i v16 = _mm_shuffle_epi8(vector, swap128_16);

will yield:

vector = 0x0706050403020100 0x1514131211100908
       = 0x03020100 0x07060504 0x11100908 0x15141312
       = 0x0100 0x0302 0x0504 0x0706 0x0908 0x1110 0x1312 0x1514
       = 0x00 0x01 0x02 0x03 0x04 0x05 0x06 0x07 0x08 0x09 0x10 0x11 0x12 0x13 0x14 0x15

v128 = 0x0809101112131415 0x0001020304050607
     = 0x12131415 0x08091011 0x04050607 0x00010203
     = 0x1415 0x1213 0x1011 0x0809 0x0607 0x0405 0x0203 0x0001
     = 0x15 0x14 0x13 0x12 0x11 0x10 0x09 0x08 0x07 0x06 0x05 0x04 0x03 0x02 0x01 0x00

v64 = 0x0001020304050607 0x0809101112131415
    = 0x04050607 0x00010203 0x12131415 0x08091011
    = 0x0607 0x0405 0x0203 0x0001 0x1415 0x1213 0x1011 0x0809
    = 0x07 0x06 0x05 0x04 0x03 0x02 0x01 0x00 0x15 0x14 0x13 0x12 0x11 0x10 0x09 0x08

v32 = 0x0405060700010203 0x1213141508091011
    = 0x00010203 0x04050607 0x08091011 0x12131415
    = 0x0203 0x0001 0x0607 0x0405 0x1011 0x0809 0x1415 0x1213
    = 0x03 0x02 0x01 0x00 0x07 0x06 0x05 0x04 0x11 0x10 0x09 0x08 0x15 0x14 0x13 0x12

v16 = 0x0607040502030001 0x1415121310110809
    = 0x02030001 0x06070405 0x10110809 0x14151213
    = 0x0001 0x0203 0x0405 0x0607 0x0809 0x1011 0x1213 0x1415
    = 0x01 0x00 0x03 0x02 0x05 0x04 0x07 0x06 0x09 0x08 0x11 0x10 0x13 0x12 0x15 0x14

depending on how you wish to interpret each __m128i. (The first one is as two 64-bit integers, second as four 32-bit integers, third as eight 16-bit integers, and fourth as sixteen bytes.)

There are many other possible variations (for 128 bit values, 16 unique byte orders are possible), but without knowing exactly what is the underlying problem and what it is that OP is trying to achieve, I won't bother exploring them all.

answered on Stack Overflow Jan 14, 2019 by

Nominal Animal • edited Jan 14, 2019 by

Nominal Animal

User contributions licensed under CC BY-SA 3.0