How to clear all but the first non-zero lane in neon?

Question

How to clear all but the first non-zero lane in neon?

I have a mask in a uint32x4_t neon register. In this mask at least 1 of the 4 ints is set (e.g. 0xffffffff), however, I may have a case where there are more than one items set in the register. How can I ensure that only one is set?

in c pseudo code:

uint32x4_t clearmask(uint32x4_t m)
{
         if (m[0]) { m[1] = m[2] = m[3] = 0; }
    else if (m[1]) { m[2] = m[3] = 0; }
    else if (m[2]) { m[3] = 0; }
    return m;
}

Basically I want to clear all but one of the set lanes. Obvious straightforward implementation in neon could be:

uint32x4_t cleanmask(uint32x4_t m)
{
    uint32x4_t mx;
    mx = vdupq_lane_u32(vget_low_u32(vmvnq_u32(m)), 0);
    mx = vsetq_lane_u32(0xffffffff, mx, 0);
    m = vandq_u32(m, mx);

    mx = vdupq_lane_u32(vget_low_u32(vmvnq_u32(m)), 1);
    mx = vsetq_lane_u32(0xffffffff, mx, 1);
    m = vandq_u32(m, mx);

    mx = vdupq_lane_u32(vget_high_u32(vmvnq_u32(m)), 0);
    mx = vsetq_lane_u32(0xffffffff, mx, 2);
    m = vandq_u32(m, mx);

    return m;
}

How can this be done more efficiently in arm neon?

c++

arm

intrinsics

neon

asked on Stack Overflow Jul 8, 2018 by

Pavel • edited Jul 8, 2018 by

Peter Cordes

2 Answers

Very simple:

vceq.u32    q1, q0, #0
vmov.i8     d7, #0xff
vext.8      q2, q3, q1, #12

vand        q0, q0, q2
vand        d1, d1, d2
vand        d1, d1, d4

6 instructions total, 5 if you can keep q3 as a constant.

The aarch64 version below must be easier to understand:

cmeq    v1.4s, v0.4s, #0
movi    v31.16b, #0xff

ext     v2.16b, v31.16b, v1.16b, #12
ext     v3.16b, v31.16b, v1.16b, #8
ext     v4.16b, v31.16b, v1.16b, #4

and     v0.16b, v0.16b, v2.16b
and     v0.16b, v0.16b, v3.16b
and     v0.16b, v0.16b, v4.16b

How this works

ext/vext takes a window from the concatenation of two vectors, so we're creating masks

v0 = [  d   c   b   a ]

v2 = [ !c  !b  !a  -1 ]
v3 = [ !b  !a  -1  -1 ]
v4 = [ !a  -1  -1  -1 ]

The highest element (d) is zeroed if any of the previous elements are non-zero.

The 2nd highest element (c) is zeroed if any of its preceding elements (a or b) are non-zero. And so on.

With elements guaranteed to 0 or -1, mvn also works instead of a compare against zero.

answered on Stack Overflow Jul 8, 2018 by

Jake 'Alquimista' LEE • edited Jul 8, 2018 by

Peter Cordes

I had nearly the same idea as your uncommented code: broadcast inverted elements as an AND mask to zero later elements if that one is set, otherwise leave the vector unmodified.

But if you're using this in a loop and have 3 spare vector registers, you can NOT all but one element with XOR, instead of MVN + set one element.

vdupq_lane_u32(vget_low_u32(m), 1); appears to compile efficiently as a vdup.32 q9, d16[1], and that part of my code is the same as yours (but without the mvn).

Unfortunately this is a long serial dependency chain; we're creating the next mask from the AND result, so there's no ILP. I don't see a good way to make this lower latency while still getting the desired result.

uint32x4_t cleanmask_xor(uint32x4_t m)
{
    //                 {  a    b    c   d }
    uint32x4_t maska = {  0, ~0U, ~0U, ~0U};
    uint32x4_t maskb = {~0U,   0, ~0U, ~0U};
    uint32x4_t maskc = {~0U, ~0U,   0, ~0U};

    uint32x4_t tmp = vdupq_lane_u32(vget_low_u32(m), 0);
    uint32x4_t aflip = tmp ^ maska;
    m &= aflip;  // if a was non-zero, the rest are zero

    tmp = vdupq_lane_u32(vget_low_u32(m), 1);
    uint32x4_t bflip = tmp ^ maskb;
    m &= bflip;  // if b was non-zero, the rest are zero

    tmp = vdupq_lane_u32(vget_high_u32(m), 0);
    uint32x4_t cflip = tmp ^ maskc;
    m &= cflip;  // if b was non-zero, the rest are zero

    return m;
}

(Godbolt)

/* design notes
  [ a   b   c   d ]
  [ a  ~a  ~a  ~a ] 

&:[ a   0   0   0 ]
or[ 0   b   c   d ]

= [ e   f   g   h  ]
  [ ~f  f   ~f  ~f ]  // not b, because f can be zero when b isn't

= [ i   j   k   l ]
  ...
*/

With the loads hoisted out of a loop, this is only 9 instructions vs. 12, because we skip the vmov.32 d1[0], r3 or whatever to insert a -1 in each mask. (ANDing an element with itself is equivalent to ANDing with -1U.) veor with all-ones in the other elements replaces vmvn.

clang seems to be inefficient at loading multiple vector constants: it sets up each address separately instead of just storing them near each other where it can reach from one base pointer. So you might want to consider alternate strategies for creating the 3 constants.

#if 1
    // clang sets up the address of each constant separately
    //                 {  a    b    c   d }
    uint32x4_t maska = {  0, ~0U, ~0U, ~0U};
    uint32x4_t maskb = {~0U,   0, ~0U, ~0U};
    uint32x4_t maskc = {~0U, ~0U,   0, ~0U};
#else
    static const uint32_t maskbuf[] = 
      { -1U, -1U, 0, -1U, -1U, -1U};
    // unaligned loads.
    // or load one + shuffle?
#endif

answered on Stack Overflow Jul 8, 2018 by

Peter Cordes

User contributions licensed under CC BY-SA 3.0