I want to define an efficient integer floor function, i.e. a conversion from float or double that performs truncation towards minus infinity.
We can assume that the values are such that no integer overflow occurs. So far I have a few options
casting to int; this requires special handling of the negative values, as the cast truncates toward zero;
I= int(F); if (I < 0 && I != F) I--;
casting the result of floor to int;
int(floor(F));
casting to int with a large shift to get positives (this can return wrong results for large values);
int(F + double(0x7fffffff)) - 0x7fffffff;
Casting to int is notoriously slow. So are if tests. I have not timed the floor function, but seen posts claiming it is also slow.
Can you think of better alternatives in terms of speed, accuracy or allowed range ? It doesn't need to be portable. The targets are recent x86/x64 architectures.
Casting to int is notoriously slow.
Maybe you've been living under a rock since x86-64, or otherwise missed that this hasn't been true for a while on x86. :)
SSE/SSE2 have an instruction to convert with truncation (instead of the default rounding mode). The ISA supports this operation efficiently precisely because conversion with C semantics is not rare in actual codebases. x86-64 code uses SSE/SSE2 XMM registers for scalar FP math, not x87, because of this and other things that make it more efficient. Even modern 32-bit code uses XMM registers for scalar math.
When compiling for x87 (without SSE3 fisttp
), compilers used to have to change the x87 rounding mode to truncation, FP store to memory, then change the rounding mode back again. (And then reload the integer from memory, typically from a local on the stack, if doing further stuff with it.) x87 was terrible for this.
Yes that was horribly slow, e.g. in 2006 when the link in @Kirjain's answer was written, if you still had a 32-bit CPU or were using an x86-64 CPU to run 32-bit code.
Converting with a rounding mode other than truncation or default (nearest) isn't directly supported, and until SSE4.1 roundps
/roundpd
your best bet was magic-number tricks like in the 2006 link from @Kirjain's answer.
Some nice tricks there, but only for double
-> 32-bit integer. Unlikely to be worth expanding to double
if you have float
.
Or more usually, simply add a large-magnitude number to trigger rounding, then subtract it again to get back to the original range. This can work for float
without expanding to double
, but I'm not sure how easy it is to make floor
work.
Anyway, the obvious solution here is _mm256_floor_ps()
and _mm256_cvtps_epi32
(vroundps
and vcvtps2dq
). A non-AVX version of this can work with SSE4.1.
I'm not sure if we can do even better; If you had a huge array to process (and couldn't manage to interleave this work with other work), you could set the MXCSR rounding mode to "towards -Inf" (floor) and simply use vcvtps2dq
(which uses the current rounding mode). Then set it back. But it's probably better to cache-block your conversion or do it on the fly as you generate the data, presumably from other FP calculations that need the FP rounding mode set to the default Nearest.
roundps
/pd/ss/sd is 2 uops on Intel CPUs, but only 1 uop (per 128-bit lane) on AMD Ryzen. cvtps2dq
is also 1 uop. packed double->int conversion also includes a shuffle. Scalar FP->int conversion (that copies to an integer register) usually also costs an extra uop for that.
So there's room for the possibility of magic-number tricks being a win in some cases; it maybe worth investigating if _mm256_floor_ps()
+ cvt are part of a critical bottleneck (or more likely if you have double and want int32).
@Cássio Renan's int foo = floorf(f)
will actually auto-vectorize if compiled with gcc -O3 -fno-trapping-math
(or -ffast-math
), with -march=
something that has SSE4.1 or AVX. https://godbolt.org/z/ae_KPv
That's maybe useful if you're using this with other scalar code that's not manually vectorized. Especially if you're hoping the compiler will auto-vectorize the whole thing.
Have a look at magic numbers. The algorithm proposed on the web page should be far more efficient than simple casting. I've never used it myself, but this is the performance comparison they offer on the site (xs_ToInt and xs_CRoundToInt are the proposed functions):
Performing 10000000 times:
simple cast 2819 ms i.e. i = (long)f;
xs_ToInt 1242 ms i.e. i = xs_ToInt(f); //numerically same as above
bit-twiddle(full) 1093 ms i.e. i = BitConvertToInt(f); //rounding from Fluid
fistp 676 ms i.e. i = FISTToInt(f); //Herf, et al x86 Assembly rounding
bit-twiddle(limited) 623 ms i.e. i = FloatTo23Bits(f); //Herf, rounding only in the range (0...1]
xs_CRoundToInt 609 ms i.e. i = xs_CRoundToInt(f); //rounding with "magic" numbers
Further, the xs_ToInt is apparently modified so that the performance improves:
Performing 10000000 times:
simple cast convert 3186 ms i.e. fi = (f*65536);
fistp convert 3031 ms i.e. fi = FISTToInt(f*65536);
xs_ToFix 622 ms i.e. fi = xs_Fix<16>::ToFix(f);
Brief explanation of how the 'magic numbers' method works:
"Basically, in order to add two floating point numbers, your processor "lines up" the decimal points of the numbers so that it can easily add the bits. It does this by "normalizing" the numbers such that the most significant bits are preserved, i.e. the smaller number "normalizes" to match the bigger one. So the principle of the "magic number" conversion that xs_CRoundToInt() uses is this: We add a big enough floating point number (a number that is so big that there are significant digits only UP TO the decimal point, and none after it) to the one you're converting such that: (a) the number gets normalized by the processor to its integer equivalent and (b) adding the two does not erase the integral significat bits in the number you were trying to convert (i.e. XX00 + 00YY = XXYY)."
The quote is taken from the same web page.
If you're doing this in batch, the compiler may autovectorize it, if you know what you're doing. For example, here is an small implementation that autovectorizes the conversion of floats to integers, on GCC:
#include <cmath>
// Compile with -O3 and -march=native to see autovectorization
__attribute__((optimize("-fno-trapping-math")))
void testFunction(float* input, int* output, int length) {
// Assume the input and output are aligned on a 32-bit boundary.
// Of course, you have to ensure this when calling testFunction, or else
// you will have problems.
input = static_cast<float*>(__builtin_assume_aligned(input, 32));
output = static_cast<int*>(__builtin_assume_aligned(output, 32));
// Also assume the length is a multiple of 32.
if (length & 31) __builtin_unreachable();
// Do the conversion
for (int i = 0; i < length; ++i) {
output[i] = floor(input[i]);
}
}
This is the generated assembly for x86-64 (With AVX512 instructions):
testFunction(float*, int*, int):
test edx, edx
jle .L5
lea ecx, [rdx-1]
xor eax, eax
.L3:
# you can see here that the conversion was vectorized
# to a vrndscaleps (that will round the float appropriately)
# and a vcvttps2dq (thal will perform the conversion)
vrndscaleps ymm0, YMMWORD PTR [rdi+rax], 1
vcvttps2dq ymm0, ymm0
vmovdqa64 YMMWORD PTR [rsi+rax], ymm0
add rax, 32
cmp rax, rdx
jne .L3
vzeroupper
.L5:
ret
If your target doesn't support AVX512, it will still autovectorize using SSE4.1 instructions, assuming you have those. This is the output with -O3 -msse4.1
:
testFunction(float*, int*, int):
test edx, edx
jle .L1
shr edx, 2
xor eax, eax
sal rdx, 4
.L3:
roundps xmm0, XMMWORD PTR [rdi+rax], 1
cvttps2dq xmm0, xmm0
movaps XMMWORD PTR [rsi+rax], xmm0
add rax, 16
cmp rax, rdx
jne .L3
.L1:
ret
Why not just use this:
#include <cmath>
auto floor_(float const x) noexcept
{
int const t(x);
return t - (t > x);
}
Here’s a modification of Cássio Renan’s excellent answer. It replaces all compiler-specific extensions with standard C++ and is, in theory, portable to any conforming compiler. In addition, it checks that the arguments are properly aligned rather than assuming so. It optimizes to the same code.
#include <assert.h>
#include <cmath>
#include <stddef.h>
#include <stdint.h>
#define ALIGNMENT alignof(max_align_t)
using std::floor;
// Compiled with: -std=c++17 -Wall -Wextra -Wpedantic -Wconversion -fno-trapping-math -O -march=cannonlake -mprefer-vector-width=512
void testFunction(const float in[], int32_t out[], const ptrdiff_t length)
{
static_assert(sizeof(float) == sizeof(int32_t), "");
assert((uintptr_t)(void*)in % ALIGNMENT == 0);
assert((uintptr_t)(void*)out % ALIGNMENT == 0);
assert((size_t)length % (ALIGNMENT/sizeof(int32_t)) == 0);
alignas(ALIGNMENT) const float* const input = in;
alignas(ALIGNMENT) int32_t* const output = out;
// Do the conversion
for (int i = 0; i < length; ++i) {
output[i] = static_cast<int32_t>(floor(input[i]));
}
}
This doesn’t optimize quite as nicely on GCC as the original, which used non-portable extensions. The C++ standard does support an alignas
specifier, references to aligned arrays, and a std::align
function that returns an aligned range within a buffer. None of these, however, make any compiler I tested generate aligned instead of unaligned vector loads and stores.
Although alignof(max_align_t)
is only 16 on x86_64, and it is possible to define ALIGNMENT
as the constant 64, this doesn’t help any compiler generate better code, so I went for portability. The closest thing to a portable way to force the compiler to assume a poitner is aligned would be to use the types from <immintrin.h>
, which most compilers for x86 support, or define a struct
with an alignas
specifier. By checking predefined macros, you could also expand a macro to __attribute__ ((aligned (ALIGNMENT)))
on Linux compilers, or __declspec (align (ALIGNMENT))
on Windows compilers, and something safe on a compiler we don’t know about, but GCC needs the attribute on a type to actually generate aligned loads and stores.
Additionally, the original example called a bulit-in to tell GCC that it was impossible for length
not to be a multiple of 32. If you assert()
this or call a standard function such as abort()
, neither GCC, Clang nor ICC will make the same deduction. Therefore, most of the code they generate will handle the case where length
is not a nice round multiple of the vector width.
A likely reason for this is that neither optimization get you that much speed: unaligned memory instructions with aligned addresses are fast on Intel CPUs, and the code to handle the case where length
is not a nice round number is a few bytes long and runs in constant time.
As a footnote, GCC is able to optimize inline functions from <cmath>
better than the macros implemented in <math.c>
.
GCC 9.1 needs a particular set of options to generate AVX512 code. By default, even with -march=cannonlake
, it will prefer 256-bit vectors. It needs the -mprefer-vector-width=512
to generate 512-bit code. (Thanks to Peter Cordes for pointing this out.) It follows up the vectorized loop with unrolled code to convert any leftover elements of the array.
Here’s the vectorized main loop, minus some constant-time initialization, error-checking and clean-up code that will only run once:
.L7:
vrndscaleps zmm0, ZMMWORD PTR [rdi+rax], 1
vcvttps2dq zmm0, zmm0
vmovdqu32 ZMMWORD PTR [rsi+rax], zmm0
add rax, 64
cmp rax, rcx
jne .L7
The eagle-eyed will notice two differences from the code generated by Cássio Renan’s program: it uses %zmm instead of %ymm registers, and it stores the results with an unaligned vmovdqu32
rather than an aligned vmovdqa64
.
Clang 8.0.0 with the same flags makes different choices about unrolling loops. Each iteration operates on eight 512-bit vectors (that is, 128 single-precision floats), but the code to pick up leftovers is not unrolled. If there are at least 64 floats left over after that, it uses another four AVX512 instructions for those, and then cleans up any extras with an unvectorized loop.
If you compile the original program in Clang++, it will accept it without complaint, but won’t make the same optimizations: it will still not assume that the length
is a multiple of the vector width, nor that the pointers are aligned.
It prefers AVX512 code to AVX256, even without -mprefer-vector-width=512
.
test rdx, rdx
jle .LBB0_14
cmp rdx, 63
ja .LBB0_6
xor eax, eax
jmp .LBB0_13
.LBB0_6:
mov rax, rdx
and rax, -64
lea r9, [rax - 64]
mov r10, r9
shr r10, 6
add r10, 1
mov r8d, r10d
and r8d, 1
test r9, r9
je .LBB0_7
mov ecx, 1
sub rcx, r10
lea r9, [r8 + rcx]
add r9, -1
xor ecx, ecx
.LBB0_9: # =>This Inner Loop Header: Depth=1
vrndscaleps zmm0, zmmword ptr [rdi + 4*rcx], 9
vrndscaleps zmm1, zmmword ptr [rdi + 4*rcx + 64], 9
vrndscaleps zmm2, zmmword ptr [rdi + 4*rcx + 128], 9
vrndscaleps zmm3, zmmword ptr [rdi + 4*rcx + 192], 9
vcvttps2dq zmm0, zmm0
vcvttps2dq zmm1, zmm1
vcvttps2dq zmm2, zmm2
vmovups zmmword ptr [rsi + 4*rcx], zmm0
vmovups zmmword ptr [rsi + 4*rcx + 64], zmm1
vmovups zmmword ptr [rsi + 4*rcx + 128], zmm2
vcvttps2dq zmm0, zmm3
vmovups zmmword ptr [rsi + 4*rcx + 192], zmm0
vrndscaleps zmm0, zmmword ptr [rdi + 4*rcx + 256], 9
vrndscaleps zmm1, zmmword ptr [rdi + 4*rcx + 320], 9
vrndscaleps zmm2, zmmword ptr [rdi + 4*rcx + 384], 9
vrndscaleps zmm3, zmmword ptr [rdi + 4*rcx + 448], 9
vcvttps2dq zmm0, zmm0
vcvttps2dq zmm1, zmm1
vcvttps2dq zmm2, zmm2
vcvttps2dq zmm3, zmm3
vmovups zmmword ptr [rsi + 4*rcx + 256], zmm0
vmovups zmmword ptr [rsi + 4*rcx + 320], zmm1
vmovups zmmword ptr [rsi + 4*rcx + 384], zmm2
vmovups zmmword ptr [rsi + 4*rcx + 448], zmm3
sub rcx, -128
add r9, 2
jne .LBB0_9
test r8, r8
je .LBB0_12
.LBB0_11:
vrndscaleps zmm0, zmmword ptr [rdi + 4*rcx], 9
vrndscaleps zmm1, zmmword ptr [rdi + 4*rcx + 64], 9
vrndscaleps zmm2, zmmword ptr [rdi + 4*rcx + 128], 9
vrndscaleps zmm3, zmmword ptr [rdi + 4*rcx + 192], 9
vcvttps2dq zmm0, zmm0
vcvttps2dq zmm1, zmm1
vcvttps2dq zmm2, zmm2
vcvttps2dq zmm3, zmm3
vmovups zmmword ptr [rsi + 4*rcx], zmm0
vmovups zmmword ptr [rsi + 4*rcx + 64], zmm1
vmovups zmmword ptr [rsi + 4*rcx + 128], zmm2
vmovups zmmword ptr [rsi + 4*rcx + 192], zmm3
.LBB0_12:
cmp rax, rdx
je .LBB0_14
.LBB0_13: # =>This Inner Loop Header: Depth=1
vmovss xmm0, dword ptr [rdi + 4*rax] # xmm0 = mem[0],zero,zero,zero
vroundss xmm0, xmm0, xmm0, 9
vcvttss2si ecx, xmm0
mov dword ptr [rsi + 4*rax], ecx
add rax, 1
cmp rdx, rax
jne .LBB0_13
.LBB0_14:
pop rax
vzeroupper
ret
.LBB0_7:
xor ecx, ecx
test r8, r8
jne .LBB0_11
jmp .LBB0_12
ICC 19 also generates AVX512 instructions, but very different from clang
. It does more set-up with magic constants, but does not unroll any loops, operating instead on 512-bit vectors.
This code also works on other compilers and architectures. (Although MSVC only supports the ISA up to AVX2 and cannot auto-vectorize the loop.) On ARM with -march=armv8-a+simd
, for example, it generates a vectorized loop with frintm v0.4s, v0.4s
and fcvtzs v0.4s, v0.4s
.
User contributions licensed under CC BY-SA 3.0