#ifdef __GNUC__, use
__builtin_ctz(unsigned). (GCC manual). GCC, clang, and ICC all support it on all target ISAs. (On ISAs where there's no native instruction, it will call a GCC helper function.)
For 64-bit integers, use
__builtin_ctzll(unsigned long long). It's unfortunate that GNU C bitscan builtins don't take fixed-width types (especially trailing zeros), but
unsigned is always 32-bit on GNU C for x86 (although not for AVR or MSP430).
unsigned long long is always
uint64_t on all GNU C targets I'm aware of.
On x86, it will compile to
tzcnt depending on tuning + target options.
tzcnt is a single uop with 3 cycle latency on modern Intel, and only 2 uops with 2 cycle latency on AMD (perhaps a bit-reverse to feed an lzcnt uop?) https://agner.org/optimize/. Either way it's directly supported by fast hardware, and is much faster than anything you can do in pure C++.
The builtin has undefined behaviour for inputs with no bits set, allowing it to avoid any extra checks if it might run as
In other compilers (specifically MSVC), you might want an intrinsic for TZCNT, like
immintrin.h. (Intel intrinsics guide). Or you might need to include
intrin.h (MSVC) or
x86intrin.h for non-SIMD intrinsics.
TZCNT decode as BSF on CPUs without BMI1 because its machine-code encoding is
rep bsf. They give identical results for non-zero inputs, so compilers can and do always just use
tzcnt because that's much faster on AMD. (They're the same speed on Intel so no downside. And on Skylake and later, tzcnt has no false output dependency. BSF does because it leaves its output unmodified for input=0).
(The situation is less convenient for
lzcnt: bsr returns the bit-index, lzcnt returns the leading-zero count. So for best performance on AMD, you need to know that your code will only run on CPUs supporting BMI1 / TBM so the compiler can use
Note that with exactly 1 bit set, scanning from either direction will find the same bit. So
31 - lzcnt = bsr = bsf = tzcnt. Possibly useful if porting to another ISA that only has leading-zero count and no bit-reverse instruction.
ffs()which returns a 1-based index and has to do extra work to account for the possibility of the input being 0.
Compilers do recognize
ffs() and inline it like a builtin (like they do for memcpy or sqrt), but don't always manage to optimize away all the work their canned sequence does to implement it when you actually want a 0-based index. It's especially hard to tell the compiler there's only 1 bit set.
User contributions licensed under CC BY-SA 3.0