# How does the _mm_cmpgt_epi64 intrinsic work

0

I'm using the `_mm_cmpgt_epi64` intrinsic to implement a 128-bit addition, and later a 256-bit one. Looking at the result of this intrinsic something puzzles me.

I don't understand why the computed mask is the way it is.

``````const __m128i mask = _mm_cmpgt_epi64(bflip, sumflip);
``````

And here's the output in my debugger:

``````(lldb) p/x bflip
(__m128i) \$1 = (0x00000001, 0x80000000, 0x00000000, 0x80000000)
(lldb) p/x sumflip
(__m128i) \$2 = (0x00000000, 0x80000000, 0xffffffff, 0x7fffffff)
(__m128i) \$3 = (0xffffffff, 0xffffffff, 0x00000000, 0x00000000)
``````

For the first 64-bit lane (`63:0`) I'm ok. But why the second lane (`127:64`) is not full of ones too?

It seems to me that `0x8000000000000000` > `0x7fffffffffffffff`.

x86-64
sse
simd
intrinsics
sse4
asked on Stack Overflow Oct 14, 2018 by Stringer • edited Oct 14, 2018 by Peter Cordes

1

It appears you're printing it in 32-bit chunks, not 64-bit, so that's weird.

But anyway, it's a signed two's complement integer compare, as documented in the manual: http://felixcloutier.com/x86/PCMPGTQ.html

`0x8000000000000000` is the most negative 64-bit integer, while `0x7fffffffffffffff` is the largest positive.

If you want an unsigned compare, you need to range-shift both inputs by flipping their sign bit. Logically this is subtracting 2^63 to go from 0..2^64-1 to -2^63 .. 2^63-1. But we can do it with a more efficient XOR, because XOR is add-without-carry, and the carry/borrow-out goes off the end of the register.

``````const __m128i rangeshift = _mm_set1_epi64x(0x8000000000000000);
const __m128i mask = _mm_cmpgt_epi64(_mm_xor_si128(bflip, rangeshift), _mm_xor_si128(sumflip, rangeshift));
``````
answered on Stack Overflow Oct 14, 2018 by Peter Cordes

User contributions licensed under CC BY-SA 3.0