Find position of first (lowest) set bit in 32-bit number

0

I need to get a 1-bit number in a 32-bit number, in which there is only one 1-bit (always). The fastest way in C ++ or asm.

For example

input:    0x00000001, 0x10000000
output:            0,         28
c++
assembly
x86
bit-manipulation
intrinsics
asked on Stack Overflow Oct 17, 2019 by Xx GoLd XxX • edited Oct 17, 2019 by Peter Cordes

1 Answer

4

#ifdef __GNUC__, use __builtin_ctz(unsigned). (GCC manual). GCC, clang, and ICC all support it on all target ISAs. (On ISAs where there's no native instruction, it will call a GCC helper function.)

For 64-bit integers, use __builtin_ctzll(unsigned long long). It's unfortunate that GNU C bitscan builtins don't take fixed-width types (especially trailing zeros), but unsigned is always 32-bit on GNU C for x86 (although not for AVR or MSP430). unsigned long long is always uint64_t on all GNU C targets I'm aware of.


On x86, it will compile to bsf or tzcnt depending on tuning + target options. tzcnt is a single uop with 3 cycle latency on modern Intel, and only 2 uops with 2 cycle latency on AMD (perhaps a bit-reverse to feed an lzcnt uop?) https://agner.org/optimize/. Either way it's directly supported by fast hardware, and is much faster than anything you can do in pure C++.

The builtin has undefined behaviour for inputs with no bits set, allowing it to avoid any extra checks if it might run as bsf.


In other compilers (specifically MSVC), you might want an intrinsic for TZCNT, like _mm_tzcnt_32 from immintrin.h. (Intel intrinsics guide). Or you might need to include intrin.h (MSVC) or x86intrin.h for non-SIMD intrinsics.


TZCNT decode as BSF on CPUs without BMI1 because its machine-code encoding is rep bsf. They give identical results for non-zero inputs, so compilers can and do always just use tzcnt because that's much faster on AMD. (They're the same speed on Intel so no downside. And on Skylake and later, tzcnt has no false output dependency. BSF does because it leaves its output unmodified for input=0).

(The situation is less convenient for bsr vs. lzcnt: bsr returns the bit-index, lzcnt returns the leading-zero count. So for best performance on AMD, you need to know that your code will only run on CPUs supporting BMI1 / TBM so the compiler can use lzcnt)

Note that with exactly 1 bit set, scanning from either direction will find the same bit. So 31 - lzcnt = bsr = bsf = tzcnt. Possibly useful if porting to another ISA that only has leading-zero count and no bit-reverse instruction.


Related:

  • https://en.wikipedia.org/wiki/Find_first_set has more about bitscan functions across ISAs. Including POSIX ffs() which returns a 1-based index and has to do extra work to account for the possibility of the input being 0.

Compilers do recognize ffs() and inline it like a builtin (like they do for memcpy or sqrt), but don't always manage to optimize away all the work their canned sequence does to implement it when you actually want a 0-based index. It's especially hard to tell the compiler there's only 1 bit set.

answered on Stack Overflow Oct 17, 2019 by Peter Cordes • edited Oct 17, 2019 by Peter Cordes

User contributions licensed under CC BY-SA 3.0