Is there any optimization possible for function to reverse the order of bytes in an int32?

1

Someone I know was recently asked this question in a coding interview. They were asked to write a function reverseBytes below such that if you pass it 0xabcdef12 it'd output 0x12efcdab.

They gave the following solution.

unsigned int reverseBytes (unsigned int x) {
  unsigned int ans = 0;
  ans = ((x & 0xff000000) >> 24) | ((x & 0x00ff0000) >> 8) |   ((x & 0x000000ff) << 24) |  ((x & 0x0000ff00) << 8);   
  return res;
}

Are there any further ways to optimize it? One can of course not do the 1st and 3rd mask, but they'd still need to do at least 2 bitmask operations, 4 bitwise-shift, and 3 bitwise-or, is that correct?

According to them, interviewer was expecting further optimizations. I'm at loss to see that there could be further optimization possible. If so, what are they?

c++
optimization
bit-manipulation
bitwise-operators
bit-shift
asked on Stack Overflow Nov 22, 2020 by Joe Black • edited Nov 22, 2020 by Joe Black

2 Answers

3

First of all, this function will not work and even compile. If I was interviewer you would have no chance to pass the test. There is a problem with operation priorities and one shift is in the wrong direction.

int is not the same as int32_t

uint32_t reverseBytes (int32_t y) {
  int32_t ans = 0;
  uint32_t x = y;
  ans = ((x & 0xff000000) >> 24) | ((x & 0x00ff0000) >> 8) |   ((x & 0x000000ff) << 24) |  ((x & 0x0000ff00) << 8);   
  return ans;
}

And any optimizing compiler will convert it a single operation if it is available on the target system

https://godbolt.org/z/5fWbGT

you can use compiler-specific builtins like uint32_t __builtin_bswap32 (uint32_t x) which is most efficient for the particular architecture and is the same efficient despite the optimization level.

answered on Stack Overflow Nov 22, 2020 by 0___________ • edited Nov 22, 2020 by 0___________
2

If you have a circular shift, start with a shift of 16.

abcdef12 -->
ef12abcd

If not, then

m = ((m << 16) & 0xffff0000) |
    ((m >> 16) & 0x0000ffff);

Then

m = ((m << 8) & 0xff00ff00) |
    ((m >> 8) & 0x00ff00ff);

This technique is barely worth doing for a mere 4 items. In fact it may be slower due to the various steps other than those being counted.

That was 2 stages, where 2 = log2(4).

But for, say, swapping all 64 bits in a 64-bit word, it is quite fast. This will take 6 stages (6 = log2(64)). That is, 3 times the work for 16 times as many things being swapped.

Another possible optimization: Note that, for example, 0xff00ff00 and 0x00ff00ff are inverses of each other. If &! is readily available in the hardware, and if loading a constant is costly, then this could be faster than the above 2nd stage:

m = ((m << 8) &   0xff00ff00) |
    ((m >> 8) & ~ 0xff00ff00);

The compiler optimizer may be able to load only the one constant and use it twice.

answered on Stack Overflow Nov 22, 2020 by Rick James • edited Nov 22, 2020 by Rick James

User contributions licensed under CC BY-SA 3.0