OpenCL result changes with arbitrary code alterations that are not related

0

This is a very strange issue. I'm working on an GPU based crypto miner and I have an issue with a SHA hash function.

1 - The initial function calls a SHA256 routine and then prints the results. I'm comparing those results to a CPU based SHA256 to make sure I get the same thing.

2 - Later on in the function, there are other operations that occur, such as adding, XOR and additional SHA rounds.

As part of the miner kernel, I wrote an auxiliary function to decompose an array of 8 uints into an array of 32 unsigned char, using AND mask and bit shift.

I'm calling the kernel with global/local work unit of 1.

So, here's where things get really strange. The part I am comparing is the very first SHA. I get a buffer of 80 bytes in, SHA it and then print the result. It matches under certain conditions. However, if I make changes to the code that is executing AFTER that SHA executes, then it doesnt match. This is what I've been able to narrow down:

1 - If I put a printf debug in the decomposition auxiliary function, the results match. Just removing that printf causes it to mismatch.

2 - There are 4 operations I use to decompose the uint into char. I tried lots of different ways to do this with the same result. However, if I remove any 1 of the 4 "for" loops in the routine, it matches. Simply removing a for loop in code that gets executed -after- the initial code, changes the result of the initial SHA.

3 - If I change my while loop to never execute then it matches. Again, this is all -after- the initial SHA comparison.

4 - If I remove all the calls to the auxiliary function, then it matches. Simply calling the function after the initial SHA causes a mismatch.

I've tried adding memory guards everywhere, however being that its 1 global and 1 local work unit, I don't see how that could apply.

I'd love to debug this, but apparently openCL cannot be debugged in VS 2019 (really?)

Any thoughts, guesses, insight would be appreciated.

Thanks!

    inline void loadUintHash ( __global unsigned char* dest, __global uint* src) {

      //**********if I remove this it doesn't work
printf ("src1 %08x%08x%08x%08x%08x%08x%08x%08x",
src[0],
src[1],
src[2],
src[3],
src[4],
src[5],
src[6],
src[7]
);
    
//**********if I take away any one of these 4 for loops, then it works
    for ( int i = 0; i < 8; i++) 
        dest[i*4+3] = (src[i] & 0xFF000000) >> 24;

    for ( int i = 0; i < 8; i++) 
        dest[i*4+2] = (src[i] & 0x00FF0000) >> 16;
        
    for ( int i = 0; i < 8; i++) 
        dest[i*4+1] = (src[i] & 0x0000FF00) >> 8;
               
    for ( int i = 0; i < 8; i++) 
        dest[i*4] = (src[i] & 0x000000FF);
        

      //**********if I remove this it doesn't work
printf ("src2 %08x%08x%08x%08x%08x%08x%08x%08x",
src[0],
src[1],
src[2],
src[3],
src[4],
src[5],
src[6],
src[7]
);


}

#define HASHOP_ADD 0
#define HASHOP_XOR 1
#define HASHOP_SHA_SINGLE 2
#define HASHOP_SHA_LOOP 3
#define HASHOP_MEMGEN 4
#define HASHOP_MEMADD 5
#define HASHOP_MEMXOR 6
#define HASHOP_MEM_SELECT 7
#define HASHOP_END 8

__kernel void dyn_hash (__global uint* byteCode, __global uint* memGenBuffer, int memGenSize, __global uint* hashResult, __global char* foundFlag, __global unsigned char* header, __global unsigned char* shaScratch) {
    
    int computeUnitID = get_global_id(0);

    __global uint* myMemGen = &memGenBuffer[computeUnitID * memGenSize * 8];        //each memGen unit is 256 bits, or 8 bytes
    __global uint* myHashResult = &hashResult[computeUnitID * 8];
    __global char* myFoundFlag = foundFlag + computeUnitID;
    __global unsigned char* myHeader = header + (computeUnitID * 80);
    __global unsigned char* myScratch = shaScratch + (computeUnitID * 32);


    sha256 ( computeUnitID, 80, myHeader, myHashResult );

    //**********this is the result I am comparing
    if (computeUnitID == 0) {
       printf ("gpu first sha uint  %08x%08x%08x%08x%08x%08x%08x%08x",
                myHashResult[0],
                myHashResult[1],
                myHashResult[2],
                myHashResult[3],
                myHashResult[4],
                myHashResult[5],
                myHashResult[6],
                myHashResult[7]
                );
    }


    uint linePtr = 0;
    uint done = 0;
    uint currentMemSize = 0;
    uint instruction = 0;

//**********if I change this to done == 1, then it works
    while (done == 0) {

        if (byteCode[linePtr] == HASHOP_ADD) {
            linePtr++;
            uint arg1[8];
            for ( int i = 0; i < 8; i++)
                arg1[i] = byteCode[linePtr+i];
            linePtr += 8;
        }


        else if (byteCode[linePtr] == HASHOP_XOR) {
            linePtr++;
            uint arg1[8];
            for ( int i = 0; i < 8; i++)
                arg1[i] = byteCode[linePtr+i];
            linePtr += 8;
        }


        else if (byteCode[linePtr] == HASHOP_SHA_SINGLE) {
            linePtr++;
        }


        else if (byteCode[linePtr] == HASHOP_SHA_LOOP) {
            printf ("HASHOP_SHA_LOOP");
            linePtr++;
            uint loopCount = byteCode[linePtr];

            
            for ( int i = 0; i < loopCount; i++) {

                loadUintHash(myScratch, myHashResult);

                sha256 ( computeUnitID, 32, myScratch, myHashResult );

                


                
                            if (computeUnitID == 1)     {
                loadUintHash(myScratch, myHashResult);

... more irrelevant code...


This is how the kernel is being called:

    size_t globalWorkSize = 1;// computeUnits;
    size_t localWorkSize = 1;
    returnVal = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &globalWorkSize, &localWorkSize, 0, NULL, NULL);
c++
opencl
asked on Stack Overflow May 14, 2021 by Shaun Neal • edited May 14, 2021 by Shaun Neal

1 Answer

0

The issue ended up being multiple things. 1 - The CPU SHA had a bug in it that was causing an incorrect result in some cases. 2 - There was a very strange syntax error which seems to have broken the compiler in a weird way:

void otherproc () { ...do stuff... }

if (something) {/ ...other code }

That forward slash after the opening curly brace was messing up "otherproc" in a weird way, and the compiler did not throw an error. After staring at the code line by line I found that slash, removed it, and everything started working.

If anyone is interested, the working implementation of a GPU miner can be found here:

https://github.com/dynamofoundation/dyn_miner

answered on Stack Overflow May 16, 2021 by Shaun Neal

User contributions licensed under CC BY-SA 3.0