How does the _mm_cmpgt_epi64 intrinsic work

Question

How does the _mm_cmpgt_epi64 intrinsic work

I'm using the _mm_cmpgt_epi64 intrinsic to implement a 128-bit addition, and later a 256-bit one. Looking at the result of this intrinsic something puzzles me.

I don't understand why the computed mask is the way it is.

const __m128i mask = _mm_cmpgt_epi64(bflip, sumflip);

And here's the output in my debugger:

(lldb) p/x bflip
(__m128i) $1 = (0x00000001, 0x80000000, 0x00000000, 0x80000000)
(lldb) p/x sumflip
(__m128i) $2 = (0x00000000, 0x80000000, 0xffffffff, 0x7fffffff)
(lldb) p/x mask
(__m128i) $3 = (0xffffffff, 0xffffffff, 0x00000000, 0x00000000)

For the first 64-bit lane (63:0) I'm ok. But why the second lane (127:64) is not full of ones too?

It seems to me that 0x8000000000000000 > 0x7fffffffffffffff.

x86-64

sse

simd

intrinsics

sse4

asked on Stack Overflow Oct 14, 2018 by

Stringer • edited Oct 14, 2018 by

Peter Cordes

1 Answer

It appears you're printing it in 32-bit chunks, not 64-bit, so that's weird.

But anyway, it's a signed two's complement integer compare, as documented in the manual: http://felixcloutier.com/x86/PCMPGTQ.html

0x8000000000000000 is the most negative 64-bit integer, while 0x7fffffffffffffff is the largest positive.

If you want an unsigned compare, you need to range-shift both inputs by flipping their sign bit. Logically this is subtracting 2^63 to go from 0..2^64-1 to -2^63 .. 2^63-1. But we can do it with a more efficient XOR, because XOR is add-without-carry, and the carry/borrow-out goes off the end of the register.

const __m128i rangeshift = _mm_set1_epi64x(0x8000000000000000);
const __m128i mask = _mm_cmpgt_epi64(_mm_xor_si128(bflip, rangeshift), _mm_xor_si128(sumflip, rangeshift));

Or use AVX512F __mmask8 _mm512_cmp[eq|ge|gt|le|lt|neq]_epu64_mask( __m512i a, __m512i b)

answered on Stack Overflow Oct 14, 2018 by

Peter Cordes

User contributions licensed under CC BY-SA 3.0