I'm using the
_mm_cmpgt_epi64 intrinsic to implement a 128-bit addition, and later a 256-bit one.
Looking at the result of this intrinsic something puzzles me.
I don't understand why the computed mask is the way it is.
const __m128i mask = _mm_cmpgt_epi64(bflip, sumflip);
And here's the output in my debugger:
(lldb) p/x bflip (__m128i) $1 = (0x00000001, 0x80000000, 0x00000000, 0x80000000) (lldb) p/x sumflip (__m128i) $2 = (0x00000000, 0x80000000, 0xffffffff, 0x7fffffff) (lldb) p/x mask (__m128i) $3 = (0xffffffff, 0xffffffff, 0x00000000, 0x00000000)
For the first 64-bit lane (
63:0) I'm ok. But why the second lane (
127:64) is not full of ones too?
It seems to me that
It appears you're printing it in 32-bit chunks, not 64-bit, so that's weird.
0x8000000000000000 is the most negative 64-bit integer, while
0x7fffffffffffffff is the largest positive.
If you want an unsigned compare, you need to range-shift both inputs by flipping their sign bit. Logically this is subtracting 2^63 to go from 0..2^64-1 to -2^63 .. 2^63-1. But we can do it with a more efficient XOR, because XOR is add-without-carry, and the carry/borrow-out goes off the end of the register.
const __m128i rangeshift = _mm_set1_epi64x(0x8000000000000000); const __m128i mask = _mm_cmpgt_epi64(_mm_xor_si128(bflip, rangeshift), _mm_xor_si128(sumflip, rangeshift));
User contributions licensed under CC BY-SA 3.0