Combining `mmap` and UVM features

Question

Combining `mmap` and UVM features

Is there a function that provides these feature simultaneously? I am looking for a function that allocated memory which has traits of both "memory-mapped" (like allocated with mmap) and UVM (accessible from both host and GPU devices). I see that cudaHostAlloc allocates a memory on the host memory that is accessible to the devices, but no apparent way to declare the allocated memory ranges as memory-mapped!

My question is this: is there an API function to allocate a memory with above-mentioned traits?

If the answer to the above question is "no", then, is there a set of API functions that I can call which leads to the same behavior?

For instance, at first, we use cudaMallocManaged to allocate a UVM-based memory then use a specific API (either POSIX or CUDA API) to declare the previously allocated memory as "memory-mapped" (just like mmap)? Or, vice vesa (allocate with mmap and then declare the range as UVM to the CUDA driver)?

Any other suggestions will also be appreciated!

UPDATE on Dec. 13, 2018:

Unfortunately, the suggestion provided by @tera seems to not be working as expected. When the code is executed on the device, it seems like the device is not able to see the memory on the host!

Below is the code that I am using with the compilation command.

#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/stat.h>
#include <assert.h>


__global__
void touchKernel(char *d, char init, int n) {
    int index =  blockIdx.x *blockDim.x + threadIdx.x;
    if(index >= n)
        return;
    d[index] = init;
}


void process_file(char* filename, int n) {
    if(n < 0) {
        printf("Error in n: %d\n", n);
        exit(1);
    }
    size_t filesize = n*sizeof(char);
    size_t pagesize = (size_t) sysconf (_SC_PAGESIZE);

    //Open file
    int fd = open(filename, O_RDWR|O_CREAT, 0666);
    // assert(fd != -1);
    if(fd == -1) {
        perror("Open API");
        exit(1);
    }
    ftruncate(fd, filesize);

    //Execute mmap
    char* mmappedData = (char*) mmap(0, filesize, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_LOCKED, fd, 0);
    assert(mmappedData != MAP_FAILED);
    printf("mmappedData: %p\n", mmappedData);

    for(int i=0;i<n;i++)
        mmappedData[i] = 'z';

    if(cudaSuccess != cudaHostRegister(mmappedData, filesize, cudaHostRegisterDefault)) {
        printf("Unable to register with CUDA!\n");
        exit(1);
    }

    int vec = 256;
    int gang = (n) / vec + 1;
    printf("gang: %d - vec: %d\n", gang, vec);
    touchKernel<<<gang, vec>>>((char*) mmappedData, 'a', n);
    cudaDeviceSynchronize();

    //Cleanup
    int rc = munmap(mmappedData, filesize);
    assert(rc == 0);


    close(fd);
}


int main(int argc, char const *argv[])
{
    process_file("buffer.obj", 10);

    return 0;
}

And to compile, here it is:

nvcc -g -O0 f1.cu && cuda-memcheck ./a.out

The cuda-memcheck will generate some outputs concerning user that the threads could not reach the memory addresses similar to below output:

========= Invalid __global__ write of size 1
=========     at 0x000000b0 in touchKernel(char*, char, int)
=========     by thread (2,0,0) in block (0,0,0)
=========     Address 0x7fdc8e137002 is out of bounds
=========     Device Frame:touchKernel(char*, char, int) (touchKernel(char*, char, int) : 0xb0)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x24d9dd]
=========     Host Frame:./a.out [0x22b22]
=========     Host Frame:./a.out [0x22d17]
=========     Host Frame:./a.out [0x570d5]
=========     Host Frame:./a.out [0x6db8]
=========     Host Frame:./a.out [0x6c76]
=========     Host Frame:./a.out [0x6cc3]
=========     Host Frame:./a.out [0x6a4c]
=========     Host Frame:./a.out [0x6ade]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xe7) [0x21b97]
=========     Host Frame:./a.out [0x673a]
=========
========= Invalid __global__ write of size 1
=========     at 0x000000b0 in touchKernel(char*, char, int)
=========     by thread (1,0,0) in block (0,0,0)
=========     Address 0x7fdc8e137001 is out of bounds
=========     Device Frame:touchKernel(char*, char, int) (touchKernel(char*, char, int) : 0xb0)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x24d9dd]
=========     Host Frame:./a.out [0x22b22]
=========     Host Frame:./a.out [0x22d17]
=========     Host Frame:./a.out [0x570d5]
=========     Host Frame:./a.out [0x6db8]
=========     Host Frame:./a.out [0x6c76]
=========     Host Frame:./a.out [0x6cc3]
=========     Host Frame:./a.out [0x6a4c]
=========     Host Frame:./a.out [0x6ade]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xe7) [0x21b97]
=========     Host Frame:./a.out [0x673a]
=========
========= Invalid __global__ write of size 1
=========     at 0x000000b0 in touchKernel(char*, char, int)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x7fdc8e137000 is out of bounds
=========     Device Frame:touchKernel(char*, char, int) (touchKernel(char*, char, int) : 0xb0)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x24d9dd]
=========     Host Frame:./a.out [0x22b22]
=========     Host Frame:./a.out [0x22d17]
=========     Host Frame:./a.out [0x570d5]
=========     Host Frame:./a.out [0x6db8]
=========     Host Frame:./a.out [0x6c76]
=========     Host Frame:./a.out [0x6cc3]
=========     Host Frame:./a.out [0x6a4c]
=========     Host Frame:./a.out [0x6ade]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xe7) [0x21b97]
=========     Host Frame:./a.out [0x673a]
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaDeviceSynchronize. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x351c13]
=========     Host Frame:./a.out [0x40a16]
=========     Host Frame:./a.out [0x6a51]
=========     Host Frame:./a.out [0x6ade]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xe7) [0x21b97]
=========     Host Frame:./a.out [0x673a]
=========

Above output means that the code was not successfully executed on the device.

Any suggestions?

UPDATE on Dec. 14, 2018

I changed the code to following:

__global__
void touchKernel(char *d, char init, int n) {
    int index =  blockIdx.x *blockDim.x + threadIdx.x;
    if(index >= n || index < 0)
        return;
    printf("index %d\n", index);
    d[index] = init + (index%20);
    printf("index %d - Done\n", index);
}

If above code is replace with the old one, one can see the output of both printf commands. If one checks the buffer.obj file, they can see that the file contains the correct output!

UPDATE on Dec. 14, 2018

Probably cuda-memcheck has some issues. It turns out that if the executable file is executed without cuda-memcheck, then the contents of buffer.obj is totally correct. However, if the executable is executed with cuda-memcheck, then the content of the output file (buffer.obj) is completely incorrect!

memory-management

cuda

mmap

memory-mapped-files

memory-mapping

asked on Stack Overflow Dec 12, 2018 by

Millad • edited Dec 15, 2018 by

Millad

1 Answer

Coincidentally I have just replied to a similar question on Nvidia's forum.

You can cudaHostRegister() mmapped memory if you pass the MAP_LOCKED flag to mmap().

You may need to increase the limit for locked memory (ulimit -m in bash) when doing so.

Update: It turns out the MAP_LOCKED flag to mmap() isn't even necessary. The documentation to cudaHostRegister() however lists a few other limitations:

On systems without unified virtual addressing, the cudaHostRegisterMapped flag needs to be passed to cudaHostRegister() or the memory will not be mapped. Unless the device has a non-zero value for the cudaDevAttrCanUseHostPointerForRegisteredMem attribute, this also means you need to query the device address for the mapped memory range via cudaHostGetDevicePointer().
The CUDA context must have been created with the cudaMapHost flag in order for mapping to be possible. Since the context is created lazily by the runtime API, you would need to create the context yourself using the driver API before any invocation of the runtime API in order to be able to affect the flags the context is created with.

answered on Stack Overflow Dec 12, 2018 by

tera • edited Dec 13, 2018 by

tera

User contributions licensed under CC BY-SA 3.0