I have a program that performs pointer chasing and I'm trying to optimize the pointer chasing loop as much as possible.
I noticed that
perf record detects that ~20% of execution time in function
myFunction() is spent executing the jump instruction (used to exit out of the loop after a specific value has been read).
Some things to take note:
__builtin_expectto avoid the cost of branch misprediction had no noticeable effect
perf record has the following output:
Samples: 153K of event 'cycles', 10000 Hz, Event count (approx.): 35559166926 myFunction /tmp/foobar [Percent: local hits] Percent│ endbr64 ... 80.09 │20: mov (%rdx,%rbx,1),%ebx 0.07 │ add $0x1,%rax │ cmp $0xffffffff,%ebx 19.84 │ ↑ jne 20 ...
I would expect that most of the cycles spent in this loop are used for reading the value from memory, which is confirmed by perf. I would also expect the remaining cycles to be somewhat evenly spent executing the remaining instructions in the loop. Instead, perf is reporting that a large chunk of the remaining cycles are spent executing the jump.
I suspect that I can better understand these costs by understanding the micro-ops used to execute these instructions, but I'm a bit lost on where to start.
Remember that the
cycles event has to pick an instruction to blame, even if both
mov-load and the macro-fused
cmp-and-branch uops are waiting for the result. It's not a matter of one or the other "costing cycles" while it's running; they're both waiting in parallel. (Modern Microprocessors
A 90-Minute Guide! and https://agner.org/optimize/)
But when the "cycles" event counter overflows, it has to pick one specific instruction to "blame", since you're using statistical-sampling. This is where an inaccurate picture of reality has to be invented by a CPU that has hundreds of uops in flight. Often it's the one waiting for a slow input that gets blamed, I think because it's often the oldest in the ROB or RS and blocking allocation of new uops by the front-end.
The details of exactly which instruction gets picked might tell us something about the internals of the CPU, but only very indirectly. Like perhaps something to do with how it retires groups of 4(?) uops, and this loop has 3, so which uop is oldest when the perf event exception is taken.
The 4:1 split is probably significant for some reason, perhaps because 4+1 = 5 cycle latency of a load with a non-simple addressing mode. (I assume this is an Intel Sandybridge-family CPU, perhaps Skylake-derived?) Like maybe if data arrives from cache on the same cycle as the perf event overflows (and chooses to sample), the
mov doesn't get the blame because it can actually execute and get out of the way?
IIRC, BeeOnRope or someone else found experimentally that Skylake CPUs would tend to let the oldest un-retired instruction retire after an exception arrives, at least if it's not a cache miss. In your case, that would be the
cmp/jne at the bottom of the loop, which in program order appears before the load at the top of the next iteration.
User contributions licensed under CC BY-SA 3.0