There is a set of rules, where each rule corresponds to ceratain cellular automaton. I need to check property of bijectivity for each of these rules. As there are too much of them (2^32 to be precise), I decided to use my GPU for this purpose. But after week or so I am still struggling with one bug.
Briefly speaking, when the kernel is enqueued and its execution is supposedly being performed on GPU, the usage of GPU is as if it is idle. Furthermore, after I added several statements to kernel code in order to see if kernel is being executed at all, I found no signs of that the statements and therefore kernel itself were executed. Besides, all error codes are equal to CL_SUCCESS. I might get something wrong as I'm new to OpenCL programming and will apreciate any help.
This is the host side code with some abbreviations:
#define CL_USE_DEPRECATED_OPENCL_2_0_APIS
//some includes here
#define GLOBAL_SIZE 4096
#define LOCAL_SIZE 256
#define GLOBAL_SCOPE 0xffffffff
int main()
{
//we assume that global_scope divides evenly into global_size
//and therefore there is no need in processing remainder
long rules_per_thread = GLOBAL_SCOPE / GLOBAL_SIZE;
int * starts = new int[GLOBAL_SIZE];
int * stops = new int[GLOBAL_SIZE];
int count = 0;
for (int i = 0; i < GLOBAL_SIZE; i++) {
starts[i] = count;
count += rules_per_thread;
stops[i] = count;
count++;
}
...
/*obtainig platform, device, building program*/
...
/*====CREATING BUFFERS====*/
//buffer for storing approved automata
const int bufSize = 10000; //size of buffer picked at random guess; might need to add some space later
uint32_t* bijective_aut = new uint32_t[bufSize];
std::fill(&bijective_aut[0], &bijective_aut[bufSize - 1], 0);
//first value in array serves as global iterator over array
//and initially is set to base offset
bijective_aut[0] = 3;
//second value serves as indicator of array length
bijective_aut[1] = bufSize;
cl::Buffer buf(context, CL_MEM_READ_WRITE, sizeof(uint32_t) * bufSize);
cl::Buffer starts_buf(context, CL_MEM_READ_ONLY, sizeof(int) * GLOBAL_SIZE);
cl::Buffer stops_buf(context, CL_MEM_READ_ONLY, sizeof(int) * GLOBAL_SIZE);
/*====SETTING UP COMMAND QUEUE====*/
cl::CommandQueue queue(context, device);
err = queue.enqueueWriteBuffer(buf, CL_FALSE, 0, sizeof(uint32_t) * bufSize, bijective_aut);
err = queue.enqueueWriteBuffer(starts_buf, CL_FALSE, 0, sizeof(int) * GLOBAL_SIZE, starts);
err = queue.enqueueWriteBuffer(stops_buf, CL_FALSE, 0, sizeof(int) * GLOBAL_SIZE, stops);
/*====CREATING KERNEL, SETTING ITS VARIABLES====*/
cl::Kernel bc_kernel(program, "bijection_check", &err);
err = bc_kernel.setArg(0, buf);
err = bc_kernel.setArg(1, starts_buf);
err = bc_kernel.setArg(2, stops_buf);
/*====EXECUTING KERNEL====*/
cl::Event event;
err = queue.enqueueNDRangeKernel(bc_kernel, cl::NullRange, cl::NDRange(GLOBAL_SIZE), cl::NDRange(LOCAL_SIZE), nullptr, &event);
event.wait();
err = queue.enqueueReadBuffer(buf, CL_FALSE, 0, sizeof(uint32_t) * bufSize, bijective_aut);
cl::finish();
}
Then there is the kernel code:
__kernel void bijection_check (
__global uint * bijective_rules, //stores approved bijective rules
__global const uint * starts,
__global const uint * stops
)
{
__private int idx = get_global_id(0);
int iterator = bijective_rules[0]; //inditates next free cell to write in
int start = starts[idx];
int stop = stops[idx];
bool check = true;
//there is some variables required for test
//iterating over rules between *start* and *stop*
for (uint rule = start; rule < stop; rule++)
{
...
/*then there goes test of the rule for bijectivity*/
...
//writing current rule to general list if it turned to be bijective
if ((check == true) && (iterator < 10000))
{
bijective_rules[iterator] = rule;
bijective_rules[0]++;
}
else
{
bijective_rules[2]++;
}
}
bijective_rules[3]++;
}
Judging by the array read from the buffer after execution, both statements at the end were not performed a single time. That is, after kernel execution bijective_rules array is left in exactly same condition as it was previosly defined on the host side.
You have a race condition: You read bijective_rules[0];
, but other threads at the same time might execute bijective_rules[0]++;
, thereby reading and writing to that memory location. If two threads write different data to the same memory address, you have a race condition and it is random which of the two gets to decide the result. So your result will be random and non-reproducible.
If multiple threads need to increment a value in the same memory location, use the atoimic function atomic_inc
. Atomic functions block the memory location while one thread is working on it, and all other threads have to wait.
To get rid of the race condition, read from one copy of the buffer (or one particular memory address) and write to a second copy (or address). This way, you never write to the memory that other concurrent threads are reading from.
User contributions licensed under CC BY-SA 3.0