C++ binary files not read correctly

0

I am reading a file that is written in high endian on a little endian intel processor in c++. The file is a generic file written in binary. I have tried reading it using open() and fopen() both but they both seem to get the same thing wrong. The file is a binary file for training images from the MNIST dataset. It contains 4 headers, each 32 bits in size and stored in high endian. My code is working, it is just not giving the right value for the 2nd header. It works for the rest of the headers. I even opened the file in a hex editor to see if the value might be wrong but it is right. The program, for some weird reason, reads only the value of the second header wrong: Here is the code that deals with reading the headers only:

void DataHandler::readInputData(std::string path){
    uint32_t headers[4];
    char bytes[4];
    std::ifstream file;
    //I tried both open() and fopen() as seen below
    file.open(path.c_str(), std::ios::binary | std::ios::in);
    //FILE* f = fopen(path.c_str(), "rb");
    if (file)
    {
        int i = 0;
        while (i < 4)//4 headers
        {
            //if (fread(bytes, sizeof(bytes), 1, f))
            //{
            //    headers[i] = format(bytes);
            //    ++i;
            //}
            file.read(bytes, sizeof(bytes));
            headers[i++] = format(bytes);
        }
        printf("Done getting images file header.\n");
        printf("magic: 0x%08x\n", headers[0]);
        printf("nImages: 0x%08x\n", headers[1]);//THIS IS THE ONE THAT IS GETTING READ WRONG
        printf("rows: 0x%08x\n", headers[2]);
        printf("cols: 0x%08x\n", headers[3]);
        exit(1);
        //reading rest of the file code here
    }
    else
    {
        printf("Invalid Input File Path\n");
        exit(1);
    }
}

//converts high endian to little indian (required for Intel Processors)
uint32_t DataHandler::format(const char * bytes) const
{
    return (uint32_t)((bytes[0] << 24) |
        (bytes[1] << 16) |
        (bytes[2] << 8) |
        (bytes[3]));
}

Output I am getting is:

Done getting images file header.
magic: 0x00000803
nImages: 0xffffea60
rows: 0x0000001c
cols: 0x0000001c

nImages should be 60,000 or (0000ea60)h in hex but it is reading it as ffff... for some reason. Here is the file opened in a hex editor: File in hex editor As we can see, the 2nd 32 bit number is 0000ea60 but it is reading it wrong...

c++
io
endianness
asked on Stack Overflow Aug 27, 2020 by Ak01

1 Answer

2

It seems that char is signed in your environment and therefore 0xEA in the data is sign-extended to 0xFFFFFFEA. This will break the higher digits.

To prevent this, you should use unsigned char instead of char. (for both of element type of bytes and the argument of format())

answered on Stack Overflow Aug 27, 2020 by MikeCAT

User contributions licensed under CC BY-SA 3.0