Why is this jump instruction so expensive when performing pointer chasing?

Question

Why is this jump instruction so expensive when performing pointer chasing?

I have a program that performs pointer chasing and I'm trying to optimize the pointer chasing loop as much as possible. I noticed that perf record detects that ~20% of execution time in function myFunction() is spent executing the jump instruction (used to exit out of the loop after a specific value has been read).

Some things to take note:

the pointer chasing path can comfortably fit in the L1 data cache
using __builtin_expect to avoid the cost of branch misprediction had no noticeable effect

perf record has the following output:

Samples: 153K of event 'cycles', 10000 Hz, Event count (approx.): 35559166926                                                                                                                                                               
myFunction  /tmp/foobar [Percent: local hits]                                                                                                                                                                            
Percent│      endbr64                                                                                                                                                                                                                       
      ...
 80.09 │20:   mov     (%rdx,%rbx,1),%ebx                                                                                                                                                                                                    
  0.07 │      add     $0x1,%rax                                                                                                                                                                                                             
       │      cmp     $0xffffffff,%ebx                                                                                                                                                                                                      
 19.84 │    ↑ jne     20                                                                                                                                                                                                                    
      ...

I would expect that most of the cycles spent in this loop are used for reading the value from memory, which is confirmed by perf. I would also expect the remaining cycles to be somewhat evenly spent executing the remaining instructions in the loop. Instead, perf is reporting that a large chunk of the remaining cycles are spent executing the jump.

I suspect that I can better understand these costs by understanding the micro-ops used to execute these instructions, but I'm a bit lost on where to start.

pointers

assembly

x86

cpu-architecture

perf

asked on Stack Overflow Jan 13, 2021 by

bryantcurto • edited Jan 13, 2021 by

Peter Cordes

1 Answer

Remember that the cycles event has to pick an instruction to blame, even if both mov-load and the macro-fused cmp-and-branch uops are waiting for the result. It's not a matter of one or the other "costing cycles" while it's running; they're both waiting in parallel. (Modern Microprocessors A 90-Minute Guide! and https://agner.org/optimize/)

But when the "cycles" event counter overflows, it has to pick one specific instruction to "blame", since you're using statistical-sampling. This is where an inaccurate picture of reality has to be invented by a CPU that has hundreds of uops in flight. Often it's the one waiting for a slow input that gets blamed, I think because it's often the oldest in the ROB or RS and blocking allocation of new uops by the front-end.

The details of exactly which instruction gets picked might tell us something about the internals of the CPU, but only very indirectly. Like perhaps something to do with how it retires groups of 4(?) uops, and this loop has 3, so which uop is oldest when the perf event exception is taken.

The 4:1 split is probably significant for some reason, perhaps because 4+1 = 5 cycle latency of a load with a non-simple addressing mode. (I assume this is an Intel Sandybridge-family CPU, perhaps Skylake-derived?) Like maybe if data arrives from cache on the same cycle as the perf event overflows (and chooses to sample), the mov doesn't get the blame because it can actually execute and get out of the way?

IIRC, BeeOnRope or someone else found experimentally that Skylake CPUs would tend to let the oldest un-retired instruction retire after an exception arrives, at least if it's not a cache miss. In your case, that would be the cmp/jne at the bottom of the loop, which in program order appears before the load at the top of the next iteration.

answered on Stack Overflow Jan 13, 2021 by

Peter Cordes

User contributions licensed under CC BY-SA 3.0