How to write LOP3 based instructions for Maxwell and up NVIDIA Architecture?

Question

How to write LOP3 based instructions for Maxwell and up NVIDIA Architecture?

Maxwell Architecture has introduced a new instruction in PTX assembly called LOP3 which according to the NVIDIA blog:

"Can save instructions when performing complex logic operations on multiple inputs."

At GTC 2016, some CUDA developers managed to accelerated the atan2f function for Tegra X1 processor (Maxwell) with such instructions.

However, the below function defined within a .cu file leads to undefined definitions for __SET_LT and __LOP3_0xe2.

Do I have to define them in .ptx file instead ? if so, how ?

float atan2f(const float dy, const float dx) 
{
 float flag, z = 0.0f;
 __SET_LT(flag, fabsf(dy), fabsf(dx));

 uint32_t m, t1 = 0x80000000; 
 float t2 = float(M_PI) / 2.0f;

 __LOP3_0x2e(m, __float_as_int(dx), t1, __float_as_int(t2));
 float w = flag * __int_as_float(m) + float(M_PI)/2.0f; 

 float Offset = copysignf(w, dy);
 float t = fminf(fabsf(dx), fabsf(dy)) / fmaxf(fabsf(dx), fabsf(dy));

 uint32_t r, b = __float_as_int(flag) << 2;
 uint32_t mask = __float_as_int(dx) ^ __float_as_int(dy) ^ (~b);
 __LOP3_0xe2(r, mask, t1, __floast_as_int(t));

 const float p = fabsf(__int_as_float(r)) - 1.0f;
 return ((-0.0663f*(-p) + 0.311f) * (-p) + float(float(M_PI)/4.0)) * (*(float *)&r) + Offset;
}

Edit:

The macro defines are finally:

#define __SET_LT(D, A, B) asm("set.lt.f32.f32 %0, %1, %2;" : "=f"(D) : "f"(A), "f"(B))
#define __SET_GT(D, A, B) asm("set.gt.f32.f32 %0, %1, %2;" : "=f"(D) : "f"(A), "f"(B))
#define __LOP3_0x2e(D, A, B, C) asm("lop3.b32 %0, %1, %2, %3, 0x2e;" : "=r"(D) : "r"(A), "r"(B), "r"(C))
#define __LOP3_0xe2(D, A, B, C) asm("lop3.b32 %0, %1, %2, %3, 0xe2;" : "=r"(D) : "r"(A), "r"(B), "r"(C))

cuda

nvidia

asked on Stack Overflow May 10, 2016 by (unknown user) • edited May 15, 2016 by (unknown user)

1 Answer

The lop3.b32 PTX instruction can perform a more-or-less arbitrary boolean (logical) operation on 3 variables A,B, and C.

In order to set the actual operation to be performed, we must provide a "lookup-table" immediate argument (immLut -- an 8-bit quantity). As indicated in the documentation, a method to compute the necessary immLut argument for a given operation F(A,B,C) is to substitute the values of 0xF0 for A, 0xCC for B, and 0xAA for C in the actual desired equation. For example suppose we want to compute:

F = (A || B) && (!C)   ((A or B) and (not-C))

Then we would compute immLut argument by:

immLut = (0xF0 | 0xCC) & (~0xAA)

Note that the specified equation for F is a boolean equation, treating the arguments A,B, and C as boolean values, and producing a true/false result (F). However, the equation to compute immLut is a bitwise logical operation.

For the above example, immLut would have a computed value of 0x54

If it's desired to use a PTX instruction in ordinary CUDA C/C++ code, probably the most common (and arguably easiest) method would be to use inline PTX. Inline PTX is documented, and there are other questions discussing how to use it (such as this one), so I'll not repeat that here.

Here is a worked example of the above example case. Note that this particular PTX instruction is only available on cc5.0 and higher architectures, so be sure to compile for at least that level of target.

$ cat t1149.cu
#include <stdio.h>

const unsigned char A_or_B_and_notC=((0xF0|0xCC)&(~0xAA));

__device__ int my_LOP_0x54(int A, int B, int C){
  int temp;
  asm("lop3.b32 %0, %1, %2, %3, 0x54;" : "=r"(temp) : "r"(A), "r"(B), "r"(C));
  return temp;
}

__global__ void testkernel(){

  printf("A=true, B=false, C=true,   F=%d\n", my_LOP_0x54(true, false, true));
  printf("A=true, B=false, C=false,  F=%d\n", my_LOP_0x54(true, false, false));
  printf("A=false, B=false, C=false, F=%d\n", my_LOP_0x54(false, false, false));
}


int main(){

  printf("0x%x\n", A_or_B_and_notC);
  testkernel<<<1,1>>>();
  cudaDeviceSynchronize();
}
$ nvcc -arch=sm_50 -o t1149 t1149.cu
$ ./t1149
0x54
A=true, B=false, C=true,   F=0
A=true, B=false, C=false,  F=1
A=false, B=false, C=false, F=0
$

Since immLut is an immediate constant in PTX code, I know of no way using inline PTX to pass this as a function parameter - even if templating is used. Based on your provided link, it seems that the authors of that presentation also used a separately defined function for the specific desired immediate value -- presumably 0xE2 and 0x2E in their case. Also, note that I have chosen to write my function so that it returns the result of the operation as the function return value. The authors of the presentation you linked appear to be passing the return value back via a function parameter. Either method should be workable. (In fact, it appears they have written their __LOP3... codes as functional macros rather than ordinary functions.)

Also see here for a method of understanding how the 8 bit truthtable (immLut) works for LOP3 at the source code level.

answered on Stack Overflow May 13, 2016 by

Robert Crovella • edited Jan 18, 2020 by

Robert Crovella

User contributions licensed under CC BY-SA 3.0