OpenCL kernel doesn't get to be executed


There is a set of rules, where each rule corresponds to ceratain cellular automaton. I need to check property of bijectivity for each of these rules. As there are too much of them (2^32 to be precise), I decided to use my GPU for this purpose. But after week or so I am still struggling with one bug.
Briefly speaking, when the kernel is enqueued and its execution is supposedly being performed on GPU, the usage of GPU is as if it is idle. Furthermore, after I added several statements to kernel code in order to see if kernel is being executed at all, I found no signs of that the statements and therefore kernel itself were executed. Besides, all error codes are equal to CL_SUCCESS. I might get something wrong as I'm new to OpenCL programming and will apreciate any help.
This is the host side code with some abbreviations:


//some includes here

#define GLOBAL_SIZE 4096
#define LOCAL_SIZE 256
#define GLOBAL_SCOPE 0xffffffff
int main()
    //we assume that global_scope divides evenly into global_size
    //and therefore there is no need in processing remainder
    long rules_per_thread = GLOBAL_SCOPE / GLOBAL_SIZE;
    int * starts = new int[GLOBAL_SIZE];
    int * stops = new int[GLOBAL_SIZE];

    int count = 0;
    for (int i = 0; i < GLOBAL_SIZE; i++) {
        starts[i] = count;
        count += rules_per_thread;
        stops[i] = count;
    /*obtainig platform, device, building program*/

    /*====CREATING BUFFERS====*/
    //buffer for storing approved automata
    const int bufSize = 10000;  //size of buffer picked at random guess; might need to add some space later
    uint32_t* bijective_aut = new uint32_t[bufSize];
    std::fill(&bijective_aut[0], &bijective_aut[bufSize - 1], 0);
    //first value in array serves as global iterator over array
    //and initially is set to base offset
    bijective_aut[0] = 3;
    //second value serves as indicator of array length
    bijective_aut[1] = bufSize;

    cl::Buffer buf(context, CL_MEM_READ_WRITE, sizeof(uint32_t) * bufSize);
    cl::Buffer starts_buf(context, CL_MEM_READ_ONLY, sizeof(int) * GLOBAL_SIZE);
    cl::Buffer stops_buf(context, CL_MEM_READ_ONLY, sizeof(int) * GLOBAL_SIZE);

    cl::CommandQueue queue(context, device);

    err = queue.enqueueWriteBuffer(buf, CL_FALSE, 0, sizeof(uint32_t) * bufSize, bijective_aut);
    err = queue.enqueueWriteBuffer(starts_buf, CL_FALSE, 0, sizeof(int) * GLOBAL_SIZE, starts);
    err = queue.enqueueWriteBuffer(stops_buf, CL_FALSE, 0, sizeof(int) * GLOBAL_SIZE, stops);

    cl::Kernel bc_kernel(program, "bijection_check", &err);
    err = bc_kernel.setArg(0, buf);
    err = bc_kernel.setArg(1, starts_buf);
    err = bc_kernel.setArg(2, stops_buf);

    /*====EXECUTING KERNEL====*/
    cl::Event event;
    err = queue.enqueueNDRangeKernel(bc_kernel, cl::NullRange, cl::NDRange(GLOBAL_SIZE), cl::NDRange(LOCAL_SIZE), nullptr, &event);
    err = queue.enqueueReadBuffer(buf, CL_FALSE, 0, sizeof(uint32_t) * bufSize, bijective_aut);


Then there is the kernel code:

 __kernel void bijection_check (
                                __global uint * bijective_rules, //stores approved bijective rules
                                __global const uint * starts, 
                                __global const uint * stops 
    __private int idx = get_global_id(0);
    int iterator = bijective_rules[0];  //inditates next free cell to write in
    int start = starts[idx];
    int stop = stops[idx];
    bool check = true;
    //there is some variables required for test

    //iterating over rules between *start* and *stop*
    for (uint rule = start; rule < stop; rule++)
        /*then there goes test of the rule for bijectivity*/

        //writing current rule to general list if it turned to be bijective
        if ((check == true) && (iterator < 10000))
            bijective_rules[iterator] = rule;

Judging by the array read from the buffer after execution, both statements at the end were not performed a single time. That is, after kernel execution bijective_rules array is left in exactly same condition as it was previosly defined on the host side.

asked on Stack Overflow Apr 25, 2021 by absqua • edited Apr 25, 2021 by absqua

1 Answer


You have a race condition: You read bijective_rules[0];, but other threads at the same time might execute bijective_rules[0]++;, thereby reading and writing to that memory location. If two threads write different data to the same memory address, you have a race condition and it is random which of the two gets to decide the result. So your result will be random and non-reproducible.

If multiple threads need to increment a value in the same memory location, use the atoimic function atomic_inc. Atomic functions block the memory location while one thread is working on it, and all other threads have to wait.

To get rid of the race condition, read from one copy of the buffer (or one particular memory address) and write to a second copy (or address). This way, you never write to the memory that other concurrent threads are reading from.

answered on Stack Overflow Apr 25, 2021 by ProjectPhysX

User contributions licensed under CC BY-SA 3.0