Optimizing global memory load in CUDA

Question

Optimizing global memory load in CUDA

My task : I have two matrices : A - (18 x 4194304) ; B - (18 x 1024).

I have to take each 18-length vector from A and compute distance with each 18-length vector from B and find minimum distance and index.

My code :

__device__
void GetMin(float &dist, int &idx)
{
    float dist2;
    int idx2;
    dist2 = __shfl_down_sync(0xFFFFFFFF, dist, 16, 32);
    idx2 = __shfl_down_sync(0xFFFFFFFF, idx, 16);
    if (dist > dist2)
    {
        dist = dist2;
        idx = idx2;
    }

    dist2 = __shfl_down_sync(0xFFFFFFFF, dist, 8, 32);
    idx2 = __shfl_down_sync(0xFFFFFFFF, idx, 8);
    if (dist > dist2)
    {
        dist = dist2;
        idx = idx2;
    }

    dist2 = __shfl_down_sync(0xFFFFFFFF, dist, 4, 32);
    idx2 = __shfl_down_sync(0xFFFFFFFF, idx, 4);
    if (dist > dist2)
    {
        dist = dist2;
        idx = idx2;
    }

    dist2 = __shfl_down_sync(0xFFFFFFFF, dist, 2, 32);
    idx2 = __shfl_down_sync(0xFFFFFFFF, idx, 2);
    if (dist > dist2)
    {
        dist = dist2;
        idx = idx2;
    }

    dist2 = __shfl_down_sync(0xFFFFFFFF, dist, 1, 32);
    idx2 = __shfl_down_sync(0xFFFFFFFF, idx, 1);
    if (dist > dist2)
    {
        dist = dist2;
        idx = idx2;
    }
}

__global__
void CalcMinDist_kernel(const float *A, const float *B, float *output, const int nNumPixels, int nNumImages)
{
    int tx = threadIdx.x + blockIdx.x * blockDim.x;
    int ty = threadIdx.y;

    int lane_id = tx % 32;

    float dist = 0;
    int idx = 0;

    float fMin = 99999999;
    int nMinIdx = -1;

    for(int i = lane_id; i < 1024; i += 32)
    {
        dist = 0;
        for(int  j = 0; j < nNumImages; ++j)
        {
            int img_idx = blockIdx.x * ty + j * nNumPixels;
            dist += (A[img_idx] - B[i * nNumImages + j]) * 
                    (A[img_idx] - B[i * nNumImages + j]);
        }
        idx = i;
        GetMin(dist, idx);

        if(threadIdx.x == 0)
        {
            if(fMin > dist)
            {
                fMin = dist;
                nMinIdx = idx;
            }
        }
    }

    if(threadIdx.x == 0)
    {
        output[blockIdx.x * ty] = nMinIdx;
    }
}

Looking at the profiler, I'm memory bound, and do have ~90% occupancy. Is there any way to speed up this operation?

Let me know if I need to provide any other information.

parallel-processing

cuda

asked on Stack Overflow Feb 24, 2020 by

Illuminati0x5B • edited Feb 24, 2020 by

Illuminati0x5B

1 Answer

Actually, I would look at the algorithm first. This is a geometric problem - treat it as such.

You should represent the B data using a different data structure, e.g. by clustering or building a partition structure (e.g. k-d tree). That will let you avoid actually computing the distance from most B elements. (You could also consider a project onto fewer dimensions, but the benefit of this may be more elusive.)

With respect to the access pattern - you would probably benefit from having consecutive threads working on consecutive elements of the 18-element-long vectors, rather than having threads work on complete 18-element-long vectors individually. That would better fit the memory layout - right now, a warp read is of many elements which are at distance 18 from each other. If I understand the code correctly anyway.

(I also think the GetMin() could avoid some of the index swaps, but that's not significant since you only perform very few of those.)

answered on Stack Overflow Feb 24, 2020 by

einpoklum • edited Feb 24, 2020 by

einpoklum

User contributions licensed under CC BY-SA 3.0