I'm currently working to create a function which accepts two 4 byte unsigned integers, and returns an 8 byte unsigned long. I've tried to base my work off of the methods depicted by this research but all my attempts have been unsuccessful. The specific inputs I am working with are: `0x12345678`

and `0xdeadbeef`

, and the result I'm looking for is `0x12de34ad56be78ef`

. This is my work so far:

```
unsigned long interleave(uint32_t x, uint32_t y){
uint64_t result = 0;
int shift = 33;
for(int i = 64; i > 0; i-=16){
shift -= 8;
//printf("%d\n", i);
//printf("%d\n", shift);
result |= (x & i) << shift;
result |= (y & i) << (shift-1);
}
}
```

However, this function keeps returning `0xfffffffe`

which is incorrect. I am printing and verifying these values using:

```
printf("0x%x\n", z);
```

and the input is initialized like so:

```
uint32_t x = 0x12345678;
uint32_t y = 0xdeadbeef;
```

Any help on this topic would be greatly appreciated, C has been a very difficult language for me, and bitwise operations even more so.

This can be done based on interleaving bits, but skipping some steps so it only interleaves bytes. Same idea: first spread out the bytes in a couple of steps, then combine them.

Here is the plan, illustrated with my amazing freehand drawing skills:

In C (not tested):

```
// step 1, moving the top two bytes
uint64_t a = (((uint64_t)x & 0xFFFF0000) << 16) | (x & 0xFFFF);
// step 2, moving bytes 2 and 6
a = ((a & 0x00FF000000FF0000) << 8) | (a & 0x000000FF000000FF);
// same thing with y
uint64_t b = (((uint64_t)y & 0xFFFF0000) << 16) | (y & 0xFFFF);
b = ((b & 0x00FF000000FF0000) << 8) | (b & 0x000000FF000000FF);
// merge them
uint64_t result = (a << 8) | b;
```

Using SSSE3 PSHUFB has been suggested, it'll work but there is an instruction that can do a byte-wise interleave in one go, punpcklbw. So all we need to really do is get the values into and out of vector registers, and that single instruction will then just care of it.

Not tested:

```
uint64_t interleave(uint32_t x, uint32_t y) {
__m128i xvec = _mm_cvtsi32_si128(x);
__m128i yvec = _mm_cvtsi32_si128(y);
__m128i interleaved = _mm_unpacklo_epi8(yvec, xvec);
return _mm_cvtsi128_si64(interleaved);
}
```

You could do it like this:

```
uint64_t interleave(uint32_t x, uint32_t y)
{
uint64_t z;
unsigned char *a = (unsigned char *)&x; // 1
unsigned char *b = (unsigned char *)&y; // 1
unsigned char *c = (unsigned char *)&z;
c[0] = a[0];
c[1] = b[0];
c[2] = a[1];
c[3] = b[1];
c[4] = a[2];
c[5] = b[2];
c[6] = a[3];
c[7] = b[3];
return z;
}
```

Interchange `a`

and `b`

on the lines marked `1`

depending on ordering requirement.

A version with shifts, where the LSB of `y`

is always the LSB of the output as in your example, is:

```
uint64_t interleave(uint32_t x, uint32_t y)
{
return
(y & 0xFFull)
| (x & 0xFFull) << 8
| (y & 0xFF00ull) << 8
| (x & 0xFF00ull) << 16
| (y & 0xFF0000ull) << 16
| (x & 0xFF0000ull) << 24
| (y & 0xFF000000ull) << 24
| (x & 0xFF000000ull) << 32;
}
```

The compilers I tried don't seem to do a good job of optimizing either version so if this is a performance critical situation then maybe the inline assembly suggestion from comments is the way to go.

answered on Stack Overflow Sep 6, 2018 by M.M

With bit-shifting and bitwise operations (endianness independent):

```
uint64_t interleave(uint32_t x, uint32_t y){
uint64_t result = 0;
for(uint8_t i = 0; i < 4; i ++){
result |= ((x & (0xFFull << (8*i))) << (8*(i+1)));
result |= ((y & (0xFFull << (8*i))) << (8*i));
}
return result;
}
```

With pointers (endianness dependent):

```
uint64_t interleave(uint32_t x, uint32_t y){
uint64_t result = 0;
uint8_t * x_ptr = (uint8_t *)&x;
uint8_t * y_ptr = (uint8_t *)&y;
uint8_t * r_ptr = (uint8_t *)&result;
for(uint8_t i = 0; i < 4; i++){
*(r_ptr++) = y_ptr[i];
*(r_ptr++) = x_ptr[i];
}
return result;
}
```

Note: this solution assumes little-endian byte order

use union punning. Easy for the compiler to optimize.

```
#include <stdio.h>
#include <stdint.h>
#include <string.h>
typedef union
{
uint64_t u64;
struct
{
union
{
uint32_t a32;
uint8_t a8[4]
};
union
{
uint32_t b32;
uint8_t b8[4]
};
};
uint8_t u8[8];
}data_64;
uint64_t interleave(uint32_t a, uint32_t b)
{
data_64 in , out;
in.a32 = a;
in.b32 = b;
for(size_t index = 0; index < sizeof(a); index ++)
{
out.u8[index * 2 + 1] = in.a8[index];
out.u8[index * 2 ] = in.b8[index];
}
return out.u64;
}
int main(void)
{
printf("%llx\n", interleave(0x12345678U, 0xdeadbeefU)) ;
}
```

answered on Stack Overflow Sep 6, 2018 by P__J__

User contributions licensed under CC BY-SA 3.0