I'm currently writing a lecture on ARM optimization, specifically on vector machines such as NEON as the final target.
And since vector machines don't fare well with if-else slaloms, I'm trying to demonstrate how to get rid of them by bit-hacking.
I picked the "saturating absolute" function as an example for this. It's practically an ABS
routine with the added functionality of capping the result at 0x7fffffff.
The biggest possible negative 32bit number is 0x80000000, and it's a very dangerous thing because val = -val;
returns the same 0x80000000 as the initial value, caused by the asymmetry in the two's complement system especially for DSP operations, and thus, it has to be filtered out, mostly by "saturating".
int32_t satAbs1(int32_t val)
{
if (val < 0) val = -val;
if (val < 0) val = 0x7fffffff;
return val;
}
Below is what I would write in assembly:
cmp r0, #0
rsblts r0, r0, #0
mvnlt r0, #0x80000000
bx lr
And below is what I actually get for the C code above:
satAbs1
0x00000000: CMP r0,#0
0x00000004: RSBLT r0,r0,#0
0x00000008: BX lr
WTH? The compiler simply discarded the saturating part altogether!
The compiler seems to be ruling out val
being negative after the first if
statement which isn't true if it was 0x80000000
Or maybe the function should return an unsigned value?
uint32_t satAbs2(int32_t val)
{
uint32_t result;
if (val < 0) result = (uint32_t) -val; else result = (uint32_t) val;
if (result == 0x80000000) result = 0x7fffffff;
return result;
}
satAbs2
0x0000000C: CMP r0,#0
0x00000010: RSBLT r0,r0,#0
0x00000014: BX lr
Unfortunately, it generates the exact same machine codes as the signed version: no saturation.
Again, the compiler seems to rule out the case of val
being 0x80000000
Ok, let's widen the range of the second if statement:
uint32_t satAbs3(int32_t val)
{
uint32_t result;
if (val < 0) result = (uint32_t) -val; else result = (uint32_t) val;
if (result >= 0x80000000) result = 0x7fffffff;
return result;
}
satAbs3
0x00000018: CMP r0,#0
0x0000001C: RSBLT r0,r0,#0
0x00000020: CMP r0,#0
0x00000024: MVNLT r0,#0x80000000
0x00000028: BX lr
Finally, the compiler seems to be doing it's job, albeit sup-optimal (an unnecessary CMP
compared to the assembly version)
I can live with the compilers being sub-optimal, but what bothers me is that they are ruling out something that they shouldn't: 0x80000000
I'd even file a bug report to GCC
devs on this, but I found out that Clang
also rules out the case of the integer being 0x80000000, and thus I suppose I'm missing something regarding to the C standard.
Can anyone tell me where I'm mistaken?
Btw, below is what the if-less bit-hacking version looks like:
int32_t satAbs_bh(int32_t val)
{
int32_t temp = val ^ (val>>31);
val = temp + (val>>31);
val ^= val>>31;
return val;
}
satAbs_bh
0x0000002C: EOR r3,r0,r0,ASR #31
0x00000030: ADD r0,r3,r0,ASR #31
0x00000034: EOR r0,r0,r0,ASR #31
0x00000038: BX lr
Edit: I agree on this question of mine being a duplicate to some degree.
However, it is way more comprehensive including some assembly level stuff and bitmask technics that might be helpful compared to the referred one.
And below comes a workaround on this problem without mangling the compiler option; rule out the possibility of integer overflow preemptively:
int32_t satAbs4(int32_t val)
{
if (val == 0x80000000) return 0x7fffffff;
if (val < 0) val = -val;
return val;
}
satAbs4
0x0000002C: CMP r0,#0x80000000
0x00000030: BEQ {pc}+0x10 ; 0x40
0x00000034: CMP r0,#0
0x00000038: RSBLT r0,r0,#0
0x0000003C: BX lr
0x00000040: MVN r0,#0x80000000
0x00000044: BX lr
Again, the linaro GCC 7.4.1
I'm using demonstrates its shortcomings: I don't understand the BEQ
in line 2. moveq r0, #0x80000001
as suggested in the source code could have saved two instructions at the end.
Signed integer overflow or underflow is undefined behavior in C, meaning that you are expected to handle these edge cases yourself. In other words, as soon as the compiler has established that a certain signed integer value is positive, it doesn't care if there is a possibility that it could turn negative through UB.
For example, this code:
int test(int input)
{
if (input > 0)
input += 100;
if (input > 0)
input += 100;
if (input > 0)
input += 100;
return input;
}
can legally be optimized to this:
int test(int input)
{
if (input > 0)
input += 300;
return input;
}
even though the author of the initial code might have expected that input
could overflow between each successive statements.
That's why an optimizing compiler sees your code as something like this:
int32_t satAbs1(int32_t val)
{
if (val < 0) val = -val;
// val must be positive here,
// unless you are relying on UB
// the following condition is
// therefore always false:
// if (val < 0) val = 0x7fffffff;
return val;
}
So, the only way to avoid UB is to avoid negating the signed integer if there is a chance that it might invoke UB, i.e.:
int32_t satAbs3_simple(int32_t val)
{
if (val >= 0)
return val;
// we know that val is negative here,
// but unfortunately gcc knows it as well,
// so we'll handle the edge case explicitly
if (val == INT32_MIN)
return INT32_MAX;
return -val;
}
gcc with -O2 produces code with a branch (early conditional return at bxge
):
satAbs3_basic:
cmp r0, #0
bxge lr // return r0 if ge #0
cmp r0, #0x80000000
rsbne r0, r0, #0
moveq r0, #0x7FFFFFFF
bx lr
As @rici mentioned in the comments, if exact-width signed int types from stdint.h
(intN_t
) are available on your compiler, this means they have to be represented with N bits, no padding, using 2's complement.
This means that you can rewrite the code slightly to use bit masks, which might provide a slightly shorter assembly output (at least with gcc 5 or newer), still without branching:
int32_t satAbs3_c(int32_t val)
{
uint32_t result = (uint32_t)val;
if (result & 0x80000000) result = -result; // <-- avoid UB here by negating uint32_t
if (result == 0x80000000) result = 0x7FFFFFFF;
return (int32_t)result;
}
Note that an optimizing compiler should theoretically be able to produce this same output for both cases, but anyway, recent gcc versions (with -O1) for the last snippet give:
satAbs3_c:
cmp r0, #0
rsblt r0, r0, #0
cmp r0, #0x80000000
moveq r0, #0x7FFFFFFF
bx lr
I actually believe it cannot get shorter than this (apart from the xor bit-hacking), because your initial assembly seems to lack a cmp r0, #0
instruction after rsblts
(because rsblts
changes r0
, and cmp
is the part where actual comparison takes place).
User contributions licensed under CC BY-SA 3.0