Is there a way to help auto-vectorizing compiler to emit saturation arithmetic intrinsic in LLVM?

3

I have a few for loops that does saturated arithmetic operations. For instance:

Implementation of saturated add in my case is as follows:

    static void addsat(Vector &R, Vector &A, Vector &B)
    {
        int32_t a, b, r;
        int32_t max_add;
        int32_t min_add;


        const int32_t SAT_VALUE = (1<<(16-1))-1;
        const int32_t SAT_VALUE2 = (-SAT_VALUE - 1);
        const int32_t sat_cond = (SAT_VALUE <=  0x7fffffff);
        const uint32_t SAT = 0xffffffff >> 16;

        for (int i=0; i<R.length; i++)
        { 

            a = static_cast<uint32_t>(A.data[i]);
            b = static_cast<uint32_t>(B.data[i]);

            max_add = (int32_t)0x7fffffff - a;
            min_add = (int32_t)0x80000000 - a;

            r = (a>0 && b>max_add) ?  0x7fffffff : a + b;
            r = (a<0 && b<min_add) ?  0x80000000 : a + b;


            if ( sat_cond == 1)
            {
                std_max(r,r,SAT_VALUE2);
                std_min(r,r,SAT_VALUE);
            }
            else
            {
                r = static_cast<uint16_t> (static_cast<int32_t> (r));

            }


 R.data[i] = static_cast<uint16_t>(r);
    }
    }

I see that there is paddsat intrinsic in x86 that could have been the perfect solution to this loop. I do get the code auto vectorized but with a combination of multiple operations according to my code. I would like to know what could be the best way to write this loop that auto-vectorizer finds the addsat operation match right.

Vector structure is:

struct V {
  static constexpr int length = 32;
  unsigned short data[32];
};

Compiler used is clang 3.8 and code was compiled for AVX2 Haswell x86-64 architecture.

performance
llvm
vectorization
avx2
auto-vectorization
asked on Stack Overflow Jun 13, 2016 by jumanji • edited Oct 25, 2017 by Christoph

0 Answers

Nobody has answered this question yet.


User contributions licensed under CC BY-SA 3.0