Branch Prediction at no cost?

8

I've just stumbled upon this thing, and I'm really curious if maybe modern CPUs (current ones, maybe mobile ones as well (embedded)) don't actually have a branching cost in the situation below.

1.Let's say we have this:

x += a; // let's assume they are both declared earlier as simple ints  
if (flag)  
   do A  // let's assume A is not the same as B  
else  
   do B  // and of course B is different than A  

2.Compared to this:

if (flag)  
{  
  x += a   
  do A  
}  
else  
{  
   x += a  
   do B  
}

Assuming A and B are completely different in therms of pipeline instructions (fetch, decode, execute, etc):

  1. Is the 2nd approach going to be faster ?

  2. Are CPUs smart enough to tell that no matter what the flag is, the next instruction is the same (so they won't have to discard pipeline stages for it because of branch miss prediction ) ?

Note:

In the first case the CPU has no option, but to discard the first few pipeline stages of the do A or do B if a branch miss prediction happened, because they are different. I see the 2nd example as a somehow delayed branching like: " I'm going to check that flag, even if I don't know the flag, I can get on with the next instruction because it's the same, no matter what the flag is, I already have the next instruction and it's OK for me to use it."

EDIT:
I did some research and I have some nice results. How would you explain this behavior ? Sorry for my latest edit, but I had some cache problems as far as I could see, these are more accurate results and code samples, I hope.

Here is the code, compiled with gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1) using -O3.

Case 1.

#include <stdio.h>

extern int * cache;
extern bool * b;
extern int * x;
extern int * a;
extern unsigned long * loop;

extern void A();
extern void B();

int main()
{
    for (unsigned long i = 0; i < *loop; ++i)
    {
        ++*cache;

        *x += *a;

        if (*b)
        {
            A();
        }
        else
        {
            B();
        }
    }

    delete b;
    delete x;
    delete a;
    delete loop;
    delete cache;

    return 0;
}

int * cache = new int(0);
bool * b = new bool(true);
int * x = new int(0);
int * a = new int(0);
unsigned long * loop = new unsigned long(0x0ffffffe);

void A() { --*x; *b = false; }
void B() { ++*x; *b = true; }

Case 2

#include <stdio.h>

extern int * cache;
extern bool * b;
extern int * x;
extern int * a;
extern unsigned long * loop;

extern void A();
extern void B();

int main()
{
    for (unsigned long i = 0; i < *loop; ++i)
    {
        ++*cache;

        if (*b)
        {
            *x += *a;
            A();
        }
        else
        {
            *x += *a;
            B();
        }
    }

    delete b;
    delete x;
    delete a;
    delete loop;
    delete cache;

    return 0;
}

int * cache = new int(0);
bool * b = new bool(true);
int * x = new int(0);
int * a = new int(0);
unsigned long * loop = new unsigned long(0x0ffffffe);

void A() { --*x; *b = false; }
void B() { ++*x; *b = true; }

There is pretty much unnoticeable difference between the -O3 versions of both approaches, but without -O3, second case does run slightly faster, at least on my machine. I have tested without -O3 and with the loop = 0xfffffffe.
Best times:
alin@ubuntu:~/Desktop$ time ./1

real 0m20.231s
user 0m20.224s
sys 0m0.020s

alin@ubuntu:~/Desktop$ time ./2

real 0m19.932s
user 0m19.890s
sys 0m0.060s

c++
c
pipeline
branch-prediction
asked on Stack Overflow Sep 27, 2015 by Alin Ionut Lipan • edited Sep 27, 2015 by Alin Ionut Lipan

2 Answers

6

There are two parts to this:

First, does the compiler optimize this?

Let's run an experiment:

test.cc

#include <random>
#include "test2.h"

int main() {
  std::default_random_engine e;
  std::uniform_int_distribution<int> d(0,1);
  int flag = d(e);

  int x = 0;
  int a = 1;

  if (flag) {
    x += a;
    doA(x);
    return x;
  } else {
    x += a;
    doB(x);
    return x;
  }
}

test2.h

void doA(int& x);
void doB(int& x);

test2.cc

void doA(int& x) {}
void doB(int& x) {}

test2.cc and test2.h both exist solely to prevent the compiler from optimizing away everything. The compiler cannot be certain that there isn't a side effect because these functions exist in another translation unit.

Now we compile to assembly:

gcc -std=c++11 -S test.cc

And let's jump to the part of the assembly that's interesting:

  call  _ZNSt24uniform_int_distributionIiEclISt26linear_congruential_engineImLm16807ELm0ELm2147483647EEEEiRT_
  movl  %eax, -40(%rbp); <- setting flag
  movl  $0, -44(%rbp);   <- setting x
  movl  $1, -36(%rbp);   <- setting a
  cmpl  $0, -40(%rbp);   <- first part of if (flag)
  je    .L2;             <- second part of if (flag)
  movl  -44(%rbp), %edx  <- setting up x
  movl  -36(%rbp), %eax  <- setting up a
  addl  %edx, %eax       <- adding x and a
  movl  %eax, -44(%rbp)  <- assigning back to x
  leaq  -44(%rbp), %rax  <- grabbing address of x
  movq  %rax, %rdi       <- bookkeeping for function call
  call  _Z3doARi         <- function call doA
  movl  -44(%rbp), %eax
  jmp   .L4
.L2:
  movl  -44(%rbp), %edx  <- setting up x
  movl  -36(%rbp), %eax  <- setting up a
  addl  %edx, %eax       <- perform the addition
  movl  %eax, -44(%rbp)  <- move it back to x
  leaq  -44(%rbp), %rax  <- and so on
  movq  %rax, %rdi
  call  _Z3doBRi
  movl  -44(%rbp), %eax
.L4:

So we can see that the compiler didn't optimize it. But we also didn't actually ask it to.

g++ -std=c++11 -S -O3 test.cc

and then the interesting assembly:

main:
.LFB4729:
  .cfi_startproc
  subq  $56, %rsp
  .cfi_def_cfa_offset 64
  leaq  32(%rsp), %rdx
  leaq  16(%rsp), %rsi
  movq  $1, 16(%rsp)
  movq  %fs:40, %rax
  movq  %rax, 40(%rsp)
  xorl  %eax, %eax
  movq  %rdx, %rdi
  movl  $0, 32(%rsp)
  movl  $1, 36(%rsp)
  call  _ZNSt24uniform_int_distributionIiEclISt26linear_congruential_engineImLm16807ELm0ELm2147483647EEEEiRT_RKNS0_10param_typeE
  testl %eax, %eax
  movl  $1, 12(%rsp)
  leaq  12(%rsp), %rdi
  jne   .L83
  call  _Z3doBRi
  movl  12(%rsp), %eax
.L80:
  movq  40(%rsp), %rcx
  xorq  %fs:40, %rcx
  jne   .L84
  addq  $56, %rsp
  .cfi_remember_state
  .cfi_def_cfa_offset 8
  ret
.L83:
  .cfi_restore_state
  call  _Z3doARi
  movl  12(%rsp), %eax
  jmp   .L80

This is a bit beyond my ability to cleanly show a 1 to 1 relationship between the assembly and the code, but you can tell from the calls to doA and doB that the setup is all common and done outside of the if statement. (Above the line jne .L83). So yes, compilers do perform this optimization.

Part 2:

How can we know if CPUs do this optimization if given the first code?

I'm actually not aware of a way to test this. So I do not know. I'd rate it as plausible given that out of order and speculative execution exists. But the proof is in the pudding, and I don't have a way of testing this pudding. So I'm reluctant to make a claim one way or another.

answered on Stack Overflow Sep 27, 2015 by OmnipotentEntity
5

Back in the day CPUs explicitly supported something a bit like this - after a branch instruction the next instruction would always be executed whether or not the branch was actually taken (look up "branch delay slot").

I'm pretty sure modern CPUs just dump the whole pipeline on a branch mispredict. There's no point attempting to do the optimisation you suggest at execution time when the compiler can easily do it at compilation time.

answered on Stack Overflow Sep 27, 2015 by Alan Stokes

User contributions licensed under CC BY-SA 3.0