I have two sets of machines, set A and set B, with apparently the same hardware/software configuration, but with a significant difference in performance. Machines in set B are up to x4 faster than machines in set A. However, if I reboot a machine in set A, inexplicably it starts performing as expected, like machines in set B. I can't find an explanation for this behavior.
E5-2630v3, 8 physical cores, 2.4GHz base frequency, 3.2GHz Turbo frequency.
8x8GB RAM DDR4, 2133Mhz, one module per channel
All the hardware components are identical in Model/Part Number.
Settings and software
BIOS version and BIOS settings are identical, e.g.
HT is enabled,
Turbo Boost is enabled. See the link for details.
The machines are running the same
64 bits version of
Red Hat 6 with kernel
Machines in both sets are idle, but whatever load I try to run, I get very different results in terms of performance. For the sake of simplicity, I will run all the benchmarks on core 0. The result is reproducible on all cores (both processors).
[root@SET_A ~]# uptime 11:48:40 up 51 days, 19:34, 2 users, load average: 0.00, 0.00, 0.00 [root@SET_A ~]# taskset -c 0 sh -c 'time echo "scale=5000; a(1)*4" | bc -l > /dev/null' real 0m43.751s user 0m43.742s sys 0m0.005s
[root@SET_B ~]# uptime 11:50:00 up 15 days, 19:43, 1 user, load average: 0.00, 0.00, 0.00 [root@SET_B ~]# taskset -c 0 sh -c 'time echo "scale=5000; a(1)*4" | bc -l > /dev/null' real 0m18.648s user 0m18.646s sys 0m0.004s
reports core 0 as being in
C0 Consumption state and in
P0 Performance state with Turbo Frequency enabled for the whole duration of the benchmark.
[root@SET_A ~]# turbostat -i 5 pk cor CPU %c0 GHz TSC SMI %c1 %c3 %c6 %c7 CTMP PTMP %pc2 %pc3 %pc6 %pc7 Pkg_W RAM_W PKG_% RAM_% 3.15 3.18 2.39 0 3.26 0.00 93.59 0.00 40 46 49.77 0.00 0.00 0.00 46.45 22.41 0.00 0.00 0 0 0 99.99 3.19 2.39 0 0.01 0.00 0.00 0.00 40 46 0.00 0.00 0.00 0.00 29.29 12.75 0.00 0.00
[root@SET_B ~]# turbostat -i 5 pk cor CPU %c0 GHz TSC SMI %c1 %c3 %c6 %c7 CTMP PTMP %pc2 %pc3 %pc6 %pc7 Pkg_W RAM_W PKG_% RAM_% 3.14 3.18 2.39 0 3.27 0.00 93.59 0.00 38 40 49.81 0.00 0.00 0.00 46.12 21.49 0.00 0.00 0 0 0 99.99 3.19 2.39 0 0.01 0.00 0.00 0.00 38 40 0.00 0.00 0.00 0.00 32.27 13.51 0.00 0.00
To simplify the benchmark as much as possible (no FP, as few memory accesses as possible) I wrote the following 32 bit code.
.text .global _start _start: movl $0x0, %ecx oloop: cmp $0x2, %ecx je end inc %ecx movl $0xFFFFFFFF,%eax movl $0x0, %ebx loop: cmp %eax, %ebx je oloop inc %ebx jmp loop end: mov $1, %eax int $0x80 .data value: .long 0
It simply increments a register from 0 to
0xFFFFFFFF twice, nothing else.
[root@SET_A ~]# md5sum simple.out 30fb3a645a8a0088ff303cf34cadea37 simple.out [root@SET_A ~]# time taskset -c 0 ./simple.out real 0m10.801s user 0m10.804s sys 0m0.001s
[root@SET_B ~]# md5sum simple.out 30fb3a645a8a0088ff303cf34cadea37 simple.out [root@SET_B ~]# time taskset -c 0 ./simple.out real 0m2.722s user 0m2.724s sys 0m0.000s
x4 difference to increment a register.
More observations with the simplified benchmark
During the benchmark, the number of interrupts is the same on both machines,
~1100 intr/s (reported with mpstat). These are mostly Local Timer Interrupts on
CPU0, so there's basically no difference in the source of interruption.
[root@SET_A ~]# mpstat -P ALL -I SUM 1 01:00:35 PM CPU intr/s 01:00:36 PM all 1117.00
[root@SET_B ~]# mpstat -P ALL -I SUM 1 01:04:50 PM CPU intr/s 01:04:51 PM all 1112.00
C-States holds the same as above.
Performance counter stats for 'taskset -c 0 ./simple.out': 41,383,515 instructions:k # 0.00 insns per cycle [71.42%] 34,360,528,207 instructions:u # 1.00 insns per cycle [71.42%] 63,675 cache-references [71.42%] 6,365 cache-misses # 9.996 % of all cache refs [71.43%] 34,439,207,904 cycles # 0.000 GHz [71.44%] 34,400,748,829 instructions # 1.00 insns per cycle [71.44%] 17,186,890,732 branches [71.44%] 143 page-faults 0 migrations 1,117 context-switches 10.905973410 seconds time elapsed
Performance counter stats for 'taskset -c 0 ./simple.out': 11,112,919 instructions:k # 0.00 insns per cycle [71.41%] 34,351,189,050 instructions:u # 3.99 insns per cycle [71.44%] 32,765 cache-references [71.46%] 3,266 cache-misses # 9.968 % of all cache refs [71.47%] 8,600,461,054 cycles # 0.000 GHz [71.46%] 34,378,806,261 instructions # 4.00 insns per cycle [71.41%] 17,192,017,836 branches [71.37%] 143 faults 2 migrations 281 context-switches 2.740606064 seconds time elapsed
sys_exit. Clearly the number of context switches is higher for Set A.
~10M). This might be caused by the same reason as above? Instructions which lead to a reschedule which are accounted as user space. Or instructions in interrupt context?
0.06%higher, but the number of
L3 cache referencesis double. Is this expected? A quick check on cache configuration leads to the same result. Cache are correctly enabled (
CR0: 0x80050033, CD is 0) and
MTRRconfiguration is identical.
Is there an obvious reason that can explain this difference in performance?
Why machines in Set A are running at 1 instr per cycle while machines in Set B are running at 4 instr per cycle, given the fact that the hardware/software configuration is identical?
Why rebooting a machine seems to fix this behavior? As explained in the introduction, if I reboot a machine in Set A, it starts performing as expected.
The cause here is either too trivial that I missed it or too complex that can't really be explained. I hope it's the former, any help/hint/suggestion is appreciated.
Of course it would be great if someone could give an answer which allows to fix your problem right away. But I fear, there is no obvious answer. But there might be some directions of attacking the problem not attempted yet:
Under the hypothesis, that some of your machines sometimes show the behavior and others never do, one could conclude subtle hardware issues (sporadic, undetectable by means of system diagnostics). The counter hypothesis would be that this can happen to any of your machines.
If the effect also shows on "good" machines, there must be subtle differences in the installation (HD content) or the hard disk is the problem.
If the effect never shows on "good" machines, the "subtle hardware issue" theory is kind of confirmed.
Put the good machines to the physical locations of the bad ones and see what happens.
Step 1 should help finding such subtle differences, identifying which of your machines work and which do not. Step 3 is about finding the explanation.
User contributions licensed under CC BY-SA 3.0