ArrayFire frame search algorithm crash

3

I am new to ArrayFire and CUDA development in general, I just started using ArrayFire a couple of days ago after failing miserably using Thrust. I am building an ArrayFire-based algorithm that is supposed to search a single 32x32 pixel frame in a database of a couple hundred thousand 32x32 frames that are stored into device memory. At first I initialize a matrix that has 1024 + 1 pixels as rows (I need an extra one to keep a frame group id) and a predefined number (this case 1000) of frames, indexed by coloumn.

Here's the function that performs the search, if I uncomment "pixels_uint32 = device_frame_ptr[pixel_group_idx];" the program crashes. The pointer seems to be valid so I do not understand why this happens. Maybe there is something I do not know regarding accessing device memory in this way?

#include <iostream>
#include <stdio.h>
#include <sys/types.h>
#include <arrayfire.h>

#include "utils.h"

using namespace af;
using namespace std;

/////////////////////////// CUDA settings ////////////////////////////////
#define TEST_DEBUG false
#define MAX_NUMBER_OF_FRAMES  1000 // maximum (2499999 frames) X (1024 + 1 pixels per frame) x (2 bytes per pixel) = 5.124.997.950 bytes (~ 5GB)
#define BLOB_FINGERPRINT_SIZE 1024 //32x32

//percentage of macroblocks that should match: 0.9 means 90%
#define MACROBLOCK_COMPARISON_OVERALL_THRESHOLD 768 //1024 * 0.75
//////////////////////// End of CUDA settings ////////////////////////////

array search_frame(array d_db_vec) 
{
    try {
        uint number_of_uint32_for_frame = BLOB_FINGERPRINT_SIZE / 2;

        // create one-element array to hold the result of the computation
        array frame_found(1,MAX_NUMBER_OF_FRAMES, u32);
        frame_found = 0;

        gfor (array frame_idx, MAX_NUMBER_OF_FRAMES) {

            // get the blob id it's the last coloumn of the matrix
            array blob_id = d_db_vec(number_of_uint32_for_frame, frame_idx);  // addressing with (pixel_idx, frame_idx)

            // define some hardcoded pixel to search for             
            uint8_t searched_r = 0x0;
            uint8_t searched_g = 0x3F;
            uint8_t searched_b = 0x0;

            uint8_t b1 = 0;
            uint8_t g1 = 0;
            uint8_t r1 = 0;

            uint8_t b2 = 0;
            uint8_t g2 = 0;
            uint8_t r2 = 0;

            uint32_t sum1 = 0;
            uint32_t sum2 = 0;

            uint32_t *device_frame_ptr   = NULL;
            uint32_t pixels_uint32       = 0;            

            uint pixel_match_counter = 0;

            //uint pixel_match_counter = 0;
            array frame = d_db_vec(span, frame_idx);
            device_frame_ptr = frame.device<uint32_t>();

            for (uint pixel_group_idx = 0; pixel_group_idx < number_of_uint32_for_frame; pixel_group_idx++) {
                // test to see if the whole matrix is traversed 
                // d_db_vec(pixel_group_idx, frame_idx) = 0;

                /////////////////////////////// PROBLEMATIC CODE ///////////////////////////////////
                pixels_uint32 = 0x7E007E0;
                //pixels_uint32 = device_frame_ptr[pixel_group_idx]; //why does this crash the program?
                // if I uncomment the above line the program tries to copy the u32 frame into the pixels_uint32 variable
                // something goes wrong, since the pointer device_frame_ptr is not NULL and the elements should be there judging by the lines above
                ////////////////////////////////////////////////////////////////////////////////////

                // splitting the first pixel into its components
                b1 = (pixels_uint32 & 0xF8000000) >> 27;   //(input & 11111000000000000000000000000000)
                g1 = (pixels_uint32 & 0x07E00000) >> 21;   //(input & 00000111111000000000000000000000)
                r1 = (pixels_uint32 & 0x001F0000) >> 16;   //(input & 00000000000111110000000000000000)

                // splitting the second pixel into its components
                b2 = (pixels_uint32 & 0xF800) >> 11;       //(input & 00000000000000001111100000000000)
                g2 = (pixels_uint32 & 0x07E0) >> 5;        //(input & 00000000000000000000011111100000)
                r2 = (pixels_uint32 & 0x001F);             //(input & 00000000000000000000000000011111)

                // checking if they are a match
                sum1 = abs(searched_r - r1) + abs(searched_g - g1) + abs(searched_b - b1);
                sum2 = abs(searched_r - r2) + abs(searched_g - g2) + abs(searched_b - b2);

                // if they match, increment the local counter
                pixel_match_counter = (sum1 <= 16) ? pixel_match_counter + 1 : pixel_match_counter;
                pixel_match_counter = (sum2 <= 16) ? pixel_match_counter + 1 : pixel_match_counter;             
            }

            bool is_found = pixel_match_counter > MACROBLOCK_COMPARISON_OVERALL_THRESHOLD;  
            // write down if the frame is a match or not        
            frame_found(0,frame_idx) = is_found ? frame_found(0,frame_idx) : blob_id;
        }

       // test to see if the whole matrix is traversed - this has to print zeroes
    if (TEST_DEBUG)
            print(d_db_vec);

        // return the matches array
    return frame_found;

    } catch (af::exception& e) {
        fprintf(stderr, "%s\n", e.what());
        throw;
    }
}

// make 2 green pixels
uint32_t make_test_pixel_group() {
    uint32_t b1 = 0x0;        //11111000000000000000000000000000
    uint32_t g1 = 0x7E00000;  //00000111111000000000000000000000
    uint32_t r1 = 0x0;        //00000000000111110000000000000000

    uint32_t b2 = 0x0;        //00000000000000001111100000000000
    uint32_t g2 = 0x7E0;      //00000000000000000000011111100000
    uint32_t r2 = 0x0;        //00000000000000000000000000011111

    uint32_t green_pix = b1 | g1 | r1 | b2 | g2 | r2;

    return green_pix;
}

int main(int argc, char ** argv) 
{
    info();

    /////////////////////////////////////// CREATE THE DATABASE ///////////////////////////////////////
    uint number_of_uint32_for_frame = BLOB_FINGERPRINT_SIZE / 2;

    array d_db_vec(number_of_uint32_for_frame + 1,   // fingerprint size + 1 extra u32 for blob id
                   MAX_NUMBER_OF_FRAMES,             // number of frames
                   u32);                             // type of elements is 32-bit unsigned integer (unsigned) with the configuration RGBRGB (565565)

    if (TEST_DEBUG == true) {
        for (uint frame_idx = 0; frame_idx < MAX_NUMBER_OF_FRAMES; frame_idx++) {       
            for (uint pix_idx = 0; pix_idx < number_of_uint32_for_frame; pix_idx++) {
                d_db_vec(pix_idx, frame_idx) = make_test_pixel_group();  // fill everything with green :D
            }
        }
    } else {
        d_db_vec = rand(number_of_uint32_for_frame + 1, MAX_NUMBER_OF_FRAMES);
    }

    cout << "Setting blob ids. \n\n";
    for (uint frame_idx = 0; frame_idx < MAX_NUMBER_OF_FRAMES; frame_idx++) {
        // set the blob id to 123456
        d_db_vec(number_of_uint32_for_frame, frame_idx) = 123456;  // blob_id = 123456
    }

    if (TEST_DEBUG)
        print(d_db_vec);

    cout << "Done setting blob ids. \n\n";

    //////////////////////////////////// CREATE THE SEARCHED FRAME ///////////////////////////////////

    // to be done, for now we use the hardcoded values at line 37-39 to simulate the searched pixel:
    //37        uint8_t searched_r = 0x0;
    //38        uint8_t searched_g = 0x3F;
    //39        uint8_t searched_b = 0x0;

    ///////////////////////////////////////////// SEARCH /////////////////////////////////////////////    
    clock_t timer = startTimer();    
    for (int i = 0; i< 1000; i++) {
        array frame_found = search_frame(d_db_vec);

        if (TEST_DEBUG)
            print(frame_found);
    }
    stopTimer(timer);

    return 0;
}

Here is the console output with the line commented:

arrayfire/examples/helloworld$ ./helloworld

ArrayFire v1.9.1 (64-bit Linux, build 9af23ea)

License: Server (27000@server.accelereyes.com)

CUDA toolkit 5.0, driver 304.54

GPU0 Tesla C2075, 5376 MB, Compute 2.0

Memory Usage: 5312 MB free (5376 MB total)

Setting blob ids.

Done setting blob ids.

Time: 0.03 seconds.


Here is the console output with the line uncommented:

arrayfire/examples/helloworld$ ./helloworld

ArrayFire v1.9.1 (64-bit Linux, build 9af23ea)

License: Server (27000@server.accelereyes.com)

CUDA toolkit 5.0, driver 304.54

GPU0 Tesla C2075, 5376 MB, Compute 2.0

Memory Usage: 5312 MB free (5376 MB total)

Setting blob ids.

Done setting blob ids.

Segmentation fault


Thanks in advance for any help on this issue. I really tried everything but without success.

search
cuda
segmentation-fault
frame
arrayfire
asked on Stack Overflow Jun 20, 2013 by DevtelSoftware • edited Jun 20, 2013 by DevtelSoftware

1 Answer

3

Disclaimer: I am the lead developer of arrayfire. I see that you have posted on AccelerEyes forums as well, but I am posting here to clear up some common issues with your code.

  1. Do not use .device(), .host(), .scalar() inside gfor loop. This will cause divergences inside the GFOR loop, and GFOR was not designed for this.

  2. You can not index into a device pointer. The pointer refers to a location on the GPU. When you do device_frame_ptr[pixel_group_idx];, the system is looking for the equivalent position on the CPU. This is the reason for your segmentation fault.

  3. Use vectorized code. For example, you don't need the inner for loop of the gfor. Instead of doing b1 = (pixels_uint32 & 0xF8000000) >> 27; inside a for loop, You can do array B1 = (frame & 0xF800000000) >> 27;. i.e. instead of getting data back to CPU and using a for loop, you are doing the entire operation inside the GPU.

  4. Don't use if-else or ternary operators inside GFOR. These cause divergences again. For example, pixel_match_counter = sum(sum1 <= 16) + sum(sum2 < 16); and found(0, found_idx) = is_found * found(0, found_idx) + (1 - is_found) * blob_id.

I have answered the particular problem you are facing. If you have any follow up questions, please follow up on our forums and / or our support email. Stackoverflow is good for asking a specific question, but not to debug your entire program.

answered on Stack Overflow Jun 20, 2013 by Pavan Yalamanchili

User contributions licensed under CC BY-SA 3.0