Is there anything special about -1 (0xFFFFFFFF) regarding ADC?

Question

Is there anything special about -1 (0xFFFFFFFF) regarding ADC?

In a research project of mine I'm writing C++ code. However, the generated assembly is one of the crucial points of the project. C++ doesn't provide direct access to flag manipulating instructions, in particular, to ADC but this shouldn't be a problem provided the compiler is smart enough to use it. Consider:

constexpr unsigned X = 0;

unsigned f1(unsigned a, unsigned b) {
    b += a;
    unsigned c = b < a;
    return c + b + X;
}

Variable c is a workaround to get my hands on the carry flag and add it to b and X. It looks I got luck and the (g++ -O3, version 9.1) generated code is this:

f1(unsigned int, unsigned int):
 add %edi,%esi
 mov %esi,%eax
 adc $0x0,%eax
 retq

For all values of X that I've tested the code is as above (except, of course for the immediate value $0x0 that changes accordingly). I found one exception though: when X == -1 (or 0xFFFFFFFFu or ~0u, ... it really doesn't matter how you spell it) the generated code is:

f1(unsigned int, unsigned int):
 xor %eax,%eax
 add %edi,%esi
 setb %al
 lea -0x1(%rsi,%rax,1),%eax
 retq

This seems less efficient than the initial code as suggested by indirect measurements (not very scientific though) Am I right? If so, is this a "missing optimization opportunity" kind of bug that is worth reporting?

For what is worth, clang -O3, version 8.8.0, always uses ADC (as I wanted) and icc -O3, version 19.0.1 never does.

I've tried using the intrinsic _addcarry_u32 but it didn't help.

unsigned f2(unsigned a, unsigned b) {
    b += a;
    unsigned char c = b < a;
    _addcarry_u32(c, b, X, &b);
    return b;
}

I reckon I might not be using _addcarry_u32 correctly (I couldn't find much info on it). What's the point of using it since it's up to me to provide the carry flag? (Again, introducing c and praying for the compiler to understand the situation.)

I might, actually, be using it correctly. For X == 0 I'm happy:

f2(unsigned int, unsigned int):
 add %esi,%edi
 mov %edi,%eax
 adc $0x0,%eax
 retq

For X == -1 I'm unhappy :-(

f2(unsigned int, unsigned int):
 add %esi,%edi
 mov $0xffffffff,%eax
 setb %dl
 add $0xff,%dl
 adc %edi,%eax
 retq

I do get the ADC but this is clearly not the most efficient code. (What's dl doing there? Two instructions to read the carry flag and restore it? Really? I hope I'm very wrong!)

c++

gcc

assembly

x86

bigint

asked on Stack Overflow May 12, 2019 by

Cassio Neri • edited Jul 12, 2019 by

Cassio Neri

1 Answer

mov + adc $-1, %eax is more efficient than xor-zero + setc + 3-component lea for both latency and uop count on most CPUs, and no worse on any still-relevant CPUs.¹

This looks like a gcc missed optimization: it probably sees a special case and latches onto that, shooting itself in the foot and preventing the adc pattern recognition from happening.

I don't know what exactly it saw / was looking for, so yes you should report this as a missed-optimization bug. Or if you want to dig deeper yourself, you could look at the GIMPLE or RTL output after optimization passes and see what happens. If you know anything about GCC's internal representations. Godbolt has a GIMPLE tree-dump window you can add from the same dropdown as "clone compiler".

The fact that clang compiles it with adc proves that it's legal, i.e. that the asm you want does match the C++ source, and you didn't miss some special case that's stopping the compiler from doing that optimization. (Assuming clang is bug-free, which is the case here.)

That problem can certainly happen if you're not careful, e.g. trying to write a general-case adc function that takes carry in and provides carry-out from the 3-input addition is hard in C, because either of the two additions can carry so you can't just use the sum < a+b idiom after adding the carry to one of the inputs. I'm not sure it's possible to get gcc or clang to emit add/adc/adc where the middle adc has to take carry-in and produce carry-out.

e.g. 0xff...ff + 1 wraps around to 0, so sum = a+b+carry_in / carry_out = sum < a can't optimize to an adc because it needs to ignore carry in the special case where a = -1 and carry_in = 1.

So another guess is that maybe gcc considered doing the + X earlier, and shot itself in the foot because of that special case. That doesn't make a lot of sense, though.

What's the point of using it since it's up to me to provide the carry flag?

You're using _addcarry_u32 correctly.

The point of its existence is to let you express an add with carry in as well as carry out, which is hard in pure C. GCC and clang don't optimize it well, often not just keeping the carry result in CF.

If you only want carry-out, you can provide a 0 as the carry in and it will optimize to add instead of adc, but still give you the carry-out as a C variable.

e.g. to add two 128-bit integers in 32-bit chunks, you can do this

// bad on x86-64 because it doesn't optimize the same as 2x _addcary_u64
// even though __restrict guarantees non-overlap.
void adc_128bit(unsigned *__restrict dst, const unsigned *__restrict src)
{
    unsigned char carry;
    carry = _addcarry_u32(0, dst[0], src[0], &dst[0]);
    carry = _addcarry_u32(carry, dst[1], src[1], &dst[1]);
    carry = _addcarry_u32(carry, dst[2], src[2], &dst[2]);
    carry = _addcarry_u32(carry, dst[3], src[3], &dst[3]);
}

(On Godbolt with GCC/clang/ICC)

That's very inefficient vs. unsigned __int128 where compilers would just use 64-bit add/adc, but does get clang and ICC to emit a chain of add/adc/adc/adc. GCC makes a mess, using setcc to store CF to an integer for some of the steps, then add dl, -1 to put it back into CF for an adc.

GCC unfortunately sucks at extended-precision / biginteger written in pure C. Clang sometimes does slightly better, but most compilers are bad at it. This is why the lowest-level gmplib functions are hand-written in asm for most architectures.

Footnote 1: or for uop count: equal on Intel Haswell and earlier where adc is 2 uops, except with a zero immediate where Sandybridge-family's decoders special case that as 1 uop.

But the 3-component LEA with a base + index + disp makes it a 3-cycle latency instruction on Intel CPUs, so it's definitely worse.

On Intel Broadwell and later, adc is a 1-uop instruction even with a non-zero immediate, taking advantage of support for 3-input uops introduced with Haswell for FMA.

So equal total uop count but worse latency means that adc would still be a better choice.

https://agner.org/optimize/

answered on Stack Overflow May 12, 2019 by

Peter Cordes

User contributions licensed under CC BY-SA 3.0