How to use bits in a byte to set dwords in ymm register without AVX2? (Inverse of vmovmskps)

Question

How to use bits in a byte to set dwords in ymm register without AVX2? (Inverse of vmovmskps)

What I'm trying to achieve is based on each bit in a byte, set to all ones in each dword in a ymm register (or memory location)

e.g.

al = 0110 0001

ymm0 = 0x00000000 FFFFFFFF FFFFFFFF 00000000 00000000 00000000 00000000 FFFFFFFF

i.e. an inverse of vmovmskps eax, ymm0 / _mm256_movemask_ps, turning a bitmap into a vector mask.

I'm thinking there are a handful of sse/avx instructions that can do this relatively simply but I haven't been able to work it out. Preferably sandy bridge compatible so no avx2.

assembly

x86-64

sse

avx

asked on Stack Overflow Feb 15, 2018 by

poby • edited Feb 17, 2018 by

Peter Cordes

2 Answers

If AVX2 is available, see is there an inverse instruction to the movemask instruction in intel avx2? instead for more efficient versions using integer SIMD. You could use that idea and split your bitmap into two 4-bit chunks for use with a LUT. That might perform fairly well: vinsertf128 has 1 per clock throughput on Sandybridge, and one per 0.5c on Haswell/Skylake.

A SIMD-integer solution with AVX1 could just do the same work twice for high/low vector halves (2x broadcast the bitmap, 2x mask it, 2x vpcmpeqd xmm), then vinsertf128, but that kinda sucks.

You might consider making an AVX2 version separate from your AVX1-only version, using vpbroadcastd ymm0, mem / vpand ymm0, mask / vpcmpeqd dst, ymm0, mask, because that's very efficient, especially if you're loading the bitmap from memory and you can read a whole dword for the bitmap. (Broadcast-loads of dword or qword don't need an ALU shuffle so it's worth overreading). The mask is set_epi32(1<<7, 1<<6, 1<<5< ..., 1<<0), which you can load with vpmovzxbd ymm, qword [constant] so it only takes 8 bytes of data memory for 8 elements.

Intrinsics version, see below for explanation and asm version. Compiles about how we expect on Godbolt with gcc/clang -march=sandybridge

#include <immintrin.h>
// AVX2 can be significantly more efficient, doing this with integer SIMD
// Especially for the case where the bitmap is in an integer register, not memory
// It's fine if `bitmap` contains high garbage; make sure your C compiler broadcasts from a dword in memory if possible instead of integer load with zero extension. 
// e.g. __m256 _mm256_broadcast_ss(float *a);  or memcpy to unsigned. 
// Store/reload is not a bad strategy vs. movd + 2 shuffles so maybe just do it even if the value might be in a register; it will force some compilers to store/broadcast-load.  But it might not be type-punning safe  even though it's an intrinsic.

// Low bit -> element 0, etc.
__m256 inverse_movemask_ps_avx1(unsigned bitmap)
{
    // if you know DAZ is off: don't OR, just AND/CMPEQ with subnormal bit patterns
    // FTZ is irrelevant, we only use bitwise booleans and CMPPS
    const __m256 exponent = _mm256_set1_ps(1.0f);   // set1_epi32(0x3f800000)
    const __m256 bit_select = _mm256_castsi256_ps(
          _mm256_set_epi32(  // exponent + low significand bits
                0x3f800000 + (1<<7), 0x3f800000 + (1<<6),
                0x3f800000 + (1<<5), 0x3f800000 + (1<<4),
                0x3f800000 + (1<<3), 0x3f800000 + (1<<2),
                0x3f800000 + (1<<1), 0x3f800000 + (1<<0)
          ));

    // bitmap |= 0x3f800000;  // more efficient to do this scalar, but only if the data was in a register to start with
    __m256  bcast = _mm256_castsi256_ps(_mm256_set1_epi32(bitmap));
    __m256  ored  = _mm256_or_ps(bcast, exponent);
    __m256  isolated = _mm256_and_ps(ored, bit_select);
    return _mm256_cmp_ps(isolated, bit_select, _CMP_EQ_OQ);
}

If we get creative, we can use AVX1 FP instructions to do the same thing. AVX1 has dword broadcast (vbroadcastss ymm0, mem), and booleans (vandps). That will produce bit patterns that are valid single-precision floats so we could use vcmpeqps, but they're all denormals if we leave the bitmap bits in the bottom of the element. That might actually be fine on Sandybridge: there might be no penalty for comparing denormals. But it will break if your code ever runs with DAZ (denormals-are-zero), so we should avoid this.

We could vpor with something to set an exponent before or after masking, or we could shift the bitmap up into the 8-bit exponent field of the IEEE floating-point format. If your bitmap starts in an integer register, shifting it would be good, because shl eax, 23 before movd is cheap. But if it starts in memory, that means giving up on using a cheap vbroadcastss load. Or you could broadcast-load to xmm, vpslld xmm0, xmm0, 23 / vinsertf128 ymm0, xmm0, 1. But that's still worse than vbroadcastss / vorps / vandps / vcmpeqps

(Scalar OR before store/reload solves the same problem.)

So:

# untested
# pointer to bitmap in rdi
inverse_movemask:
    vbroadcastss  ymm0, [rdi]

    vorps         ymm0, ymm0, [set_exponent]   ; or hoist this constant out with a broadcast-load

    vmovaps       ymm7, [bit_select]          ; hoist this out of any loop, too
    vandps        ymm0, ymm0, ymm7
    ; ymm0 exponent = 2^0, mantissa = 0 or 1<<i where i = element number
    vcmpeqps      ymm0, ymm0, ymm7
    ret

section .rodata
ALIGN 32
      ; low bit -> low element.  _mm_setr order
    bit_select: dd 0x3f800000 + (1<<0), 0x3f800000 + (1<<1)
                dd 0x3f800000 + (1<<2), 0x3f800000 + (1<<3)
                dd 0x3f800000 + (1<<4), 0x3f800000 + (1<<5)
                dd 0x3f800000 + (1<<6), 0x3f800000 + (1<<7)

    set_exponent: times 8 dd 0x3f800000    ; 1.0f
    ;  broadcast-load this instead of duplicating it in memory if you're hoisting it.

Instead of broadcast-loading set_exponent, you could instead shuffle bit_select: as long as the 0x3f800000 bits are set, it doesn't matter if element 0 also sets bit 3 or something, just not bit 0. So vpermilps or vshufps to copy-and-shuffle would work.

Or if the bitmap is in an integer register to start with, you can use scalar OR and avoid that vector constant. (And scalar OR runs on more ports.)

# alternate top of the function for input in an integer reg, not pointer.
    or     edi, 0x3f800000
    mov    [rsp-4], edi             ; red-zone
    vbroadcastss ymm0, [rsp-4]
    ;; skip the vorps

Store/reload might have similar latency to vmovd (1c), vpshufd xmm (1c), vinsertf128 (3c) = 5c total to broadcast from an integer register without AVX2 or AVX512 on Intel SnB-family. And it's fewer fused-domain uops (2 instead of 3), and doesn't hit the shuffle port (3 uops for p5 on SnB-family). Your choice might depend on whether there's there's load/store pressure or port-5 pressure in the surrounding code.

(SnB/IvB have integer-shuffle units on 2 ports, only FP shuffles are limited to 1. Haswell remove the shuffle units outside of p5. But unless you do dynamic dispatching to avoid using this on AVX2 CPUs, you might want to tune for newer CPUs while still maintaining compat with AVX1-only CPUs.)

If you were going to do an ALU broadcast with shuffles (like clang does), you could borrow clang's trick of doing a vorps xmm to save a uop on AMD CPUs that split 256-bit ops, and to allow a narrower OR constant. But that's pointless: either you had the value in an integer register (where you can use scalar or), or it was in memory where you should have used vbroadcastss ymm. I guess if tuning for AMD before Zen2 you might consider an broadcast XMM load, VPOR XMM, then vinsertf128.

https://www.h-schmidt.net/FloatConverter/IEEE754.html is a useful IEEE754 FP value <-> hex bit pattern converter, in case you want to check what value some FP bit pattern represents.

vcmpeqps has the same latency and throughput as vaddps on all Intel CPUs. (This is not a coincidence; they run on the same execution unit). That means 3 cycle latency on SnB-Broadwell, and 4 cycle latency on Skylake. But vpcmpeqd is only 1c latency.

So this method has good throughput (only 1 uop more than AVX2 integer, where vorps isn't needed), but worse latency by 3 cycles, or 4 on Skylake.

But isn't comparing floating point numbers dangerous or bad practice?

Comparison for exact equality can give unexpected results when one of the comparison inputs is the rounded result of a calculation (e.g. the output of vaddps or vmulps). Bruce Dawson's blog series on FP math in general and x86 in particular is excellent, specifically Comparing Floating Point Numbers, 2012 Edition . But in this case, we're controlling the FP bit-patterns, and there's no rounding.

Non-NaN FP values with the same bit-pattern will always compare equal.

FP values with different bit-patterns will always compare as not-equal, except for -0.0 and +0.0 (which differ in sign bit only), and denormalized values in DAZ mode. The latter is why we're using vpor; you can skip it if you know DAZ is disabled and your FP hardware doesn't require an assist for comparison of denormals. (IIRC, Sandybridge doesn't, and can even add / sub denormals without an assist. When microcode assists are needed on Intel hardware, it's usually when producing a denormal result from normal inputs, but compares don't produce an FP result.)

answered on Stack Overflow Feb 15, 2018 by

Peter Cordes • edited Oct 10, 2019 by

Peter Cordes

Preface: I know that this doesn't fulfill the (whole) requirements of the question, so this answer is not acceptable. I just post it for future reference.

There is a new AVX512(VL|BW) instruction named VPMOVM2B which does what you want in exactly one instruction:

VPMOVM2B ymm1, k1

Sets each byte in YMM1 to all 1’s or all 0’s based on the value of the corresponding bit in k1.

I couldn't test it, but it should be what you want.

answered on Stack Overflow Feb 15, 2018 by

zx485 • edited Feb 15, 2018 by

zx485

User contributions licensed under CC BY-SA 3.0