How to convince nvcc to use 128-bit wide loads?

Question

How to convince nvcc to use 128-bit wide loads?

I have a kernel that needs to apply a stencil operation on an array and store the result on another array. The stencil could be expressed in a function as:

float stencil(const float* data)
{
    return *(data-1) + *(data+1);
}

I want every thread to produce 4 contiguous values of the output array by loading 6 contiguous values of the input array. By doing so I would be able to use the float4 type for loading and storing in chunks of 128 bytes. This is my program (you can download and compile it, but please consider the kernel in first place):

#include<iostream>
#include<cstdlib>
#include<thrust/host_vector.h>
#include<thrust/device_vector.h>

__global__ void kernel(const float* input, float* output, int size)
{
    int i = 4*(blockDim.x*blockIdx.x + threadIdx.x);
    float values[6];
    float res[4];

    // Load values
    values[0] = *(input+i-1);
    *reinterpret_cast<float4*>(values+1) = *reinterpret_cast<const float4*>(input+i);
    values[5] = *(input+i+4);

    // Compute result
    res[0] = values[0]+values[2];
    res[1] = values[1]+values[3];
    res[2] = values[2]+values[4];
    res[3] = values[3]+values[5];

    // Store result
    *reinterpret_cast<float4*>(output+i) = *reinterpret_cast<const float4*>(res);
}

int main()
{
    // Parameters
    const int nBlocks = 8;
    const int nThreads = 128;
    const int nValues = 4 * nThreads * nBlocks;

    // Allocate host and device memory
    thrust::host_vector<float> input_host(nValues+64);
    thrust::device_vector<float> input(nValues+64), output(nValues);

    // Generate random input
    srand48(42);
    thrust::generate(input_host.begin(), input_host.end(), []{ return drand48()+1.; });
    input = input_host;

    // Run kernel
    kernel<<<nBlocks, nThreads>>>(thrust::raw_pointer_cast(input.data()+32), thrust::raw_pointer_cast(output.data()), nValues);

    // Check output
    for (int i = 0; i < nValues; ++i)
    {
        float ref = input_host[31+i] + input_host[33+i];

        if (ref != output[i])
        {
            std::cout << "Error at " << i << " : " << ref << "  " << output[i] << "\n";
            std::cout << "Abort with errors\n";
            std::exit(1);
        }
    }

    std::cout << "Success\n";
}

The program works perfectly.

I would expect the compiler to generate one LD.E.128 instruction for the central part of the local array values, and the registers for this central part to be contiguous (e.g. R4, R5, R6, R7); to have two LD.E instructions for both ends of values; to have one ST.E.128 for the output array.

What happens in reality is the following:

code for sm_21
    Function : _Z6kernelPKfPfi

    /*0000*/         MOV R1, c[0x1][0x100];                /* 0x2800440400005de4 */
    /*0008*/         NOP;                                  /* 0x4000000000001de4 */
    /*0010*/         MOV32I R3, 0x4;                       /* 0x180000001000dde2 */
    /*0018*/         S2R R0, SR_CTAID.X;                   /* 0x2c00000094001c04 */
    /*0020*/         S2R R2, SR_TID.X;                     /* 0x2c00000084009c04 */
    /*0028*/         IMAD R0, R0, c[0x0][0x8], R2;         /* 0x2004400020001ca3 */
    /*0030*/         SHL R6, R0, 0x2;                      /* 0x6000c00008019c03 */
    /*0038*/         IMAD R10.CC, R6, R3, c[0x0][0x20];    /* 0x2007800080629ca3 */
    /*0040*/         IMAD.HI.X R11, R6, R3, c[0x0][0x24];  /* 0x208680009062dce3 */
    /*0048*/         IMAD R2.CC, R6, R3, c[0x0][0x28];     /* 0x20078000a0609ca3 */
    /*0050*/         LD.E R4, [R10+0xc];                   /* 0x8400000030a11c85 */
    /*0058*/         IMAD.HI.X R3, R6, R3, c[0x0][0x2c];   /* 0x20868000b060dce3 */
    /*0060*/         LD.E R7, [R10+0x4];                   /* 0x8400000010a1dc85 */
    /*0068*/         LD.E R9, [R10+-0x4];                  /* 0x87fffffff0a25c85 */
    /*0070*/         LD.E R5, [R10+0x8];                   /* 0x8400000020a15c85 */
    /*0078*/         LD.E R0, [R10+0x10];                  /* 0x8400000040a01c85 */
    /*0080*/         LD.E R8, [R10];                       /* 0x8400000000a21c85 */
    /*0088*/         FADD R6, R7, R4;                      /* 0x5000000010719c00 */
    /*0090*/         FADD R4, R9, R7;                      /* 0x500000001c911c00 */
    /*0098*/         FADD R7, R5, R0;                      /* 0x500000000051dc00 */
    /*00a0*/         FADD R5, R8, R5;                      /* 0x5000000014815c00 */
    /*00a8*/         ST.E.128 [R2], R4;                    /* 0x9400000000211cc5 */
    /*00b0*/         EXIT;                                 /* 0x8000000000001de7 */
    ................................

All loads are 32-bit wide (LD.E). On the other side, there is just one store instruction ST.E.128, as expected.

I don't show the whole code here again, but I did a test where the stencil does not need a value to the left, but only one to the right (e.g. *data + *(data+1)), in which case my values array contains just 5 values and the float4 load operation modifies the first 4 values of the array (I still have one extra load for the last value). In that case the compiler uses LD.E.128.

My question is why doesn't the compiler understand that it can use the 128-bit wide read if the target register is not the first one in the local array. After all the local array values is just a programming way to say that I need 6 floats to be stored in the registers. There is no such a thing like an array in the resulting ptx or SASS code. I thought I gave the compiler enough hints for it to understand LD.E.128 was the right instruction here.

Second question: how can I make it use the 128-wide load here without having to manually write low-level code? (However if a couple of asm instructions help I'm open to receive suggestions.)

Side note: the decision of using 32-bit load for reading the input and 128-bit store for writing the input is taken while producing ptx code. ptx code already shows this pattern of multiple small loads and a single large store.

I am using CUDA 7.5 under linux.

Based on the suggestions given in the comments, I did some experiments.

Declaring either input or output as __restrict__ (or both) solves the problem: the compiler generated a LD.E.128 and two LD.E, which is what I wanted to achieve, when generating code for the architecture sm_35. Strangely enough, when generating for sm_21 it still prduces six LD.E, but it produces one ST.E.128. It sounds like a compiler bug to me, because the instruction LD.E.128 should be perfectly usable in the older architecture as it is in the newest.

The code presented above uses the 128-bit loads just with the small change of using the __restrict__ keyword as suggested by njuffa and works. I did also follow the suggestion of m.s. I reproduced the same results shown in the pastebin snippet (one LD.E.128 + one LD.E.64). But at runtime it crashes with the following error:

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  an illegal memory access was encountered

I'm pretty sure the misalignment is the cause of this problem.

Update: after using cuda-memcheck I'm sure the problem is misalignment:

========= Invalid __global__ read of size 16
=========     at 0x00000060 in kernel(float const *, float*, int)
=========     by thread (4,0,0) in block (7,0,0)
=========     Address 0xb043638bc is misaligned

vector

cuda

bandwidth

nvcc

asked on Stack Overflow Sep 25, 2015 by

Spiros • edited Sep 27, 2015 by

Spiros

1 Answer

The problem is that the nvcc compiler is unable to resolve the base address for the vector load in your kernel. This can be a bug or is just an inadequacy.

I modified your code a little bit:

  __global__ void kernel2(const float* input, float* output, int size)
  {
      int i = (blockDim.x*blockIdx.x + threadIdx.x);
      float values[6];
      float res[4];

      // Load values
      values[0] = *(input+(i*4)-1);
      float4 test  =*(reinterpret_cast<const float4*>(input)+i);
      values[5] = *(input+(i*4)+4);
      values[1] = test.x;
      values[2] = test.y;
      values[3] = test.z;
      values[4] = test.w;
      // Compute result
      res[0] = values[0]+values[2];
      res[1] = values[1]+values[3];
      res[2] = values[2]+values[4];
      res[3] = values[3]+values[5];

      // Store result
      *(reinterpret_cast<float4*>(output)+i) = *reinterpret_cast<const float4*>(res);
  }

The kernel code compiled to ptx:

  .visible .entry _Z7kernel2PKfPfi(
          .param .u64 _Z7kernel2PKfPfi_param_0,
          .param .u64 _Z7kernel2PKfPfi_param_1,
          .param .u32 _Z7kernel2PKfPfi_param_2
  )
  {
          .reg .f32       %f<15>;
          .reg .b32       %r<7>;
          .reg .b64       %rd<10>;
          ld.param.u64    %rd1, [_Z7kernel2PKfPfi_param_0];
          ld.param.u64    %rd2, [_Z7kernel2PKfPfi_param_1];
          mov.u32         %r1, %ntid.x;
          mov.u32         %r2, %ctaid.x;
          mov.u32         %r3, %tid.x;
          mad.lo.s32      %r4, %r2, %r1, %r3;
          shl.b32         %r5, %r4, 2;
          add.s32         %r6, %r5, -1;
          mul.wide.s32    %rd3, %r6, 4;
          cvta.to.global.u64      %rd4, %rd1;
          add.s64         %rd5, %rd4, %rd3;
          ld.global.f32   %f1, [%rd5];
          mul.wide.s32    %rd6, %r4, 16;
          add.s64         %rd7, %rd4, %rd6;
          ld.global.v4.f32        {%f2, %f3, %f4, %f5}, [%rd7];
          ld.global.f32   %f10, [%rd5+20];
          cvta.to.global.u64      %rd8, %rd2;
          add.s64         %rd9, %rd8, %rd6;
          add.f32         %f11, %f3, %f5;
          add.f32         %f12, %f2, %f4;
          add.f32         %f13, %f4, %f10;
          add.f32         %f14, %f1, %f3;
          st.global.v4.f32        [%rd9], {%f14, %f12, %f11, %f13};
          ret;
  }

You can see nicely how the addresses for the load are computed (%rd6 and %rd8).

While compiling your kernel to ptx results in:

  .visible .entry _Z6kernelPKfPfi(
          .param .u64 _Z6kernelPKfPfi_param_0,
          .param .u64 _Z6kernelPKfPfi_param_1,
          .param .u32 _Z6kernelPKfPfi_param_2
  )
  {
          .reg .f32       %f<11>;
          .reg .b32       %r<6>;
          .reg .b64       %rd<8>;
          ld.param.u64    %rd1, [_Z6kernelPKfPfi_param_0];
          ld.param.u64    %rd2, [_Z6kernelPKfPfi_param_1];
          cvta.to.global.u64      %rd3, %rd2;
          cvta.to.global.u64      %rd4, %rd1;
          mov.u32         %r1, %ntid.x;
          mov.u32         %r2, %ctaid.x;
          mov.u32         %r3, %tid.x;
          mad.lo.s32      %r4, %r2, %r1, %r3;
          shl.b32         %r5, %r4, 2;
          mul.wide.s32    %rd5, %r5, 4;
          add.s64         %rd6, %rd4, %rd5;
          ld.global.f32   %f1, [%rd6+-4];
          ld.global.f32   %f2, [%rd6];
          ld.global.f32   %f3, [%rd6+12];
          ld.global.f32   %f4, [%rd6+4];
          ld.global.f32   %f5, [%rd6+8];
          ld.global.f32   %f6, [%rd6+16];
          add.s64         %rd7, %rd3, %rd5;
          add.f32         %f7, %f5, %f6;
          add.f32         %f8, %f4, %f3;
          add.f32         %f9, %f2, %f5;
          add.f32         %f10, %f1, %f4;
          st.global.v4.f32        [%rd7], {%f10, %f9, %f8, %f7};
          ret;
  }

where the compiler only generates code to compute one address (%rd6) and uses static offsets. At this point the compiler failed to emit a vector load. Why? I honestly don't know, maybe two optimizations interfere here.

In SASS you see for kernel2:

        .section        .text._Z7kernel2PKfPfi,"ax",@progbits
        .sectioninfo    @"SHI_REGISTERS=18"
        .align  64
        .global         _Z7kernel2PKfPfi
        .type           _Z7kernel2PKfPfi,@function
        .size           _Z7kernel2PKfPfi,(.L_39 - _Z7kernel2PKfPfi)
        .other          _Z7kernel2PKfPfi,@"STO_CUDA_ENTRY STV_DEFAULT"
_Z7kernel2PKfPfi:
.text._Z7kernel2PKfPfi:
        /*0008*/                   MOV R1, c[0x0][0x44];
        /*0010*/                   S2R R0, SR_CTAID.X;
        /*0018*/                   MOV R4, c[0x0][0x140];
        /*0020*/                   S2R R3, SR_TID.X;
        /*0028*/                   MOV R5, c[0x0][0x144];
        /*0030*/                   IMAD R3, R0, c[0x0][0x28], R3;
        /*0038*/                   MOV32I R8, 0x10;
        /*0048*/                   IMAD R16.CC, R3, 0x10, R4;
        /*0050*/                   ISCADD R0, R3, -0x1, 0x2;
        /*0058*/                   IMAD.HI.X R17, R3, 0x10, R5;
        /*0060*/                   IMAD R14.CC, R0, 0x4, R4;
        /*0068*/                   IMAD.HI.X R15, R0, 0x4, R5;
        /*0070*/                   LD.E.128 R4, [R16];
        /*0078*/                   LD.E R2, [R14];
        /*0088*/                   IMAD R12.CC, R3, R8, c[0x0][0x148];
        /*0090*/                   LD.E R0, [R14+0x14];
        /*0098*/                   IMAD.HI.X R13, R3, R8, c[0x0][0x14c];
        /*00a0*/                   FADD R9, R4, R6;
        /*00a8*/                   FADD R10, R5, R7;
        /*00b0*/                   FADD R8, R2, R5;
        /*00b8*/                   FADD R11, R6, R0;
        /*00c8*/                   ST.E.128 [R12], R8;
        /*00d0*/                   EXIT;
.L_1:
        /*00d8*/                   BRA `(.L_1);
.L_39:

Here you have your LD.E.128.

Compiled with nvcc release 7.5, V7.5.17.

answered on Stack Overflow Sep 28, 2015 by

Michael Haidl

User contributions licensed under CC BY-SA 3.0