Unexpected performance results when trying to toggle a variable

Question

Unexpected performance results when trying to toggle a variable

I tried three approaches to toggle a variable between two values in C, code is at:

void with_xor(int *value)
{
    *value ^= VAL_ONE ^ VAL_TWO;
}

void with_conditional(int *value)
{
    *value = (*value == VAL_ONE)? VAL_TWO : VAL_ONE;
}

void with_if(int *value)
{
    if(VAL_ONE == *value)
    {      
        *value = VAL_TWO;
    }
    else
    {
        *value = VAL_ONE;
    }
}

I was expecting the "xor" approach to be much faster than the other two approaches, but seems like that is not the case. Why is that?

Here are the test results:

xor           0.052300          
conditional   0.035738          
if            0.034924

Here is a disassembly (compiled with no flags):

_with_xor:
0000000100000bd0    pushq   %rbp
0000000100000bd1    movq    %rsp,%rbp
0000000100000bd4    movq    %rdi,0xf8(%rbp)
0000000100000bd8    movq    0xf8(%rbp),%rax
0000000100000bdc    movl    (%rax),%eax
0000000100000bde    xorl    $0x0f,%eax
0000000100000be1    movq    0xf8(%rbp),%rcx
0000000100000be5    movl    %eax,(%rcx)
0000000100000be7    popq    %rbp
0000000100000be8    ret
0000000100000be9    nopl    0x00000000(%rax)
_with_conditional:
0000000100000bf0    pushq   %rbp
0000000100000bf1    movq    %rsp,%rbp
0000000100000bf4    movq    %rdi,0xf8(%rbp)
0000000100000bf8    movq    0xf8(%rbp),%rax
0000000100000bfc    movl    (%rax),%eax
0000000100000bfe    cmpl    $0x0a,%eax
0000000100000c01    jne 0x100000c0c
0000000100000c03    movl    $0x00000005,0xf4(%rbp)
0000000100000c0a    jmp 0x100000c13
0000000100000c0c    movl    $0x0000000a,0xf4(%rbp)
0000000100000c13    movq    0xf8(%rbp),%rax
0000000100000c17    movl    0xf4(%rbp),%ecx
0000000100000c1a    movl    %ecx,(%rax)
0000000100000c1c    popq    %rbp
0000000100000c1d    ret
0000000100000c1e    nop
_with_if:
0000000100000c20    pushq   %rbp
0000000100000c21    movq    %rsp,%rbp
0000000100000c24    movq    %rdi,0xf8(%rbp)
0000000100000c28    movq    0xf8(%rbp),%rax
0000000100000c2c    movl    (%rax),%eax
0000000100000c2e    cmpl    $0x0a,%eax
0000000100000c31    jne 0x100000c3f
0000000100000c33    movq    0xf8(%rbp),%rax
0000000100000c37    movl    $0x00000005,(%rax)
0000000100000c3d    jmp 0x100000c49
0000000100000c3f    movq    0xf8(%rbp),%rax
0000000100000c43    movl    $0x0000000a,(%rax)
0000000100000c49    popq    %rbp
0000000100000c4a    ret
0000000100000c4b    nopl    0x00(%rax,%rax)

c

optimization

asked on Stack Overflow Jul 2, 2012 by

codertop • edited Jul 2, 2012 by

Oliver Charlesworth

2 Answers

First of all: The "conditional" and "if" variants of your code are functionally identical, and indeed the compiler is generating nearly identical code for both.

Branch prediction is likely to be affecting your results significantly here -- because you're always toggling the value between 5 and 10 and straight back, the branch in the conditional/if variant can be predicted with nearly 100% accuracy, turning the function into a store that can run almost-concurrently with the comparison. The XOR variant, on the other hand, must run as a load-xor-store sequence -- none of those three operations can run before the previous one finishes.

answered on Stack Overflow Jul 2, 2012 by

duskwuff -inactive-

You generally can't pipeline xors on the CPU, but you can pipeline move instructions. So the XOR process is much slower.

Essentially, in your instruction pipeline, you want to execute as many instructions as possible. Since in general, branch prediction is pretty good, the CPU will be able to correctly resolve all those mov instructions and decide where everything ends up in parallel. On the other hand, it's impossible to XOR things before their predecessors have been XOR'd so the CPU has to wait for the previous XOR instruction to resolve before executing the next one. Much slower.

answered on Stack Overflow Jul 2, 2012 by

argentage

User contributions licensed under CC BY-SA 3.0