I'm using the `_mm_cmpgt_epi64`

intrinsic to implement a 128-bit addition, and later a 256-bit one.
Looking at the result of this intrinsic something puzzles me.

I don't understand why the computed mask is the way it is.

```
const __m128i mask = _mm_cmpgt_epi64(bflip, sumflip);
```

And here's the output in my debugger:

```
(lldb) p/x bflip
(__m128i) $1 = (0x00000001, 0x80000000, 0x00000000, 0x80000000)
(lldb) p/x sumflip
(__m128i) $2 = (0x00000000, 0x80000000, 0xffffffff, 0x7fffffff)
(lldb) p/x mask
(__m128i) $3 = (0xffffffff, 0xffffffff, 0x00000000, 0x00000000)
```

For the first 64-bit lane (`63:0`

) I'm ok. But why the second lane (`127:64`

) is not full of ones too?

It seems to me that `0x8000000000000000`

> `0x7fffffffffffffff`

.

It appears you're printing it in 32-bit chunks, not 64-bit, so that's weird.

But anyway, it's a **signed** two's complement integer compare, as documented in the manual: http://felixcloutier.com/x86/PCMPGTQ.html

`0x8000000000000000`

is the most negative 64-bit integer, while `0x7fffffffffffffff`

is the largest positive.

If you want an unsigned compare, you need to range-shift both inputs by flipping their sign bit. Logically this is subtracting 2^63 to go from 0..2^64-1 to -2^63 .. 2^63-1. But we can do it with a more efficient XOR, because XOR is add-without-carry, and the carry/borrow-out goes off the end of the register.

```
const __m128i rangeshift = _mm_set1_epi64x(0x8000000000000000);
const __m128i mask = _mm_cmpgt_epi64(_mm_xor_si128(bflip, rangeshift), _mm_xor_si128(sumflip, rangeshift));
```

Or use AVX512F `__mmask8 _mm512_cmp[eq|ge|gt|le|lt|neq]_epu64_mask( __m512i a, __m512i b)`

answered on Stack Overflow Oct 14, 2018 by Peter Cordes

User contributions licensed under CC BY-SA 3.0