Introduction
I have two sets of machines, set A and set B, with apparently the same hardware/software configuration, but with a significant difference in performance. Machines in set B are up to x4 faster than machines in set A. However, if I reboot a machine in set A, inexplicably it starts performing as expected, like machines in set B. I can't find an explanation for this behavior.
Hardware configuration
S2600KP
platformE5-2630v3
, 8 physical cores, 2.4GHz base frequency, 3.2GHz Turbo frequency.8x8GB RAM DDR4
, 2133Mhz, one module per channelAll the hardware components are identical in Model/Part Number.
Settings and software
BIOS version and BIOS settings are identical, e.g. HT
is enabled, Turbo Boost
is enabled. See the link for details.
The machines are running the same 64 bits
version of Red Hat 6
with kernel 2.6.32-504
.
The Problem
Machines in both sets are idle, but whatever load I try to run, I get very different results in terms of performance. For the sake of simplicity, I will run all the benchmarks on core 0. The result is reproducible on all cores (both processors).
Set A
[root@SET_A ~]# uptime
11:48:40 up 51 days, 19:34, 2 users, load average: 0.00, 0.00, 0.00
[root@SET_A ~]# taskset -c 0 sh -c 'time echo "scale=5000; a(1)*4" | bc -l > /dev/null'
real 0m43.751s
user 0m43.742s
sys 0m0.005s
Set B
[root@SET_B ~]# uptime
11:50:00 up 15 days, 19:43, 1 user, load average: 0.00, 0.00, 0.00
[root@SET_B ~]# taskset -c 0 sh -c 'time echo "scale=5000; a(1)*4" | bc -l > /dev/null'
real 0m18.648s
user 0m18.646s
sys 0m0.004s
Observations
turbostats
reports core 0 as being in C0 Consumption
state and in P0 Performance
state with Turbo Frequency enabled for the whole duration of the benchmark.
Set A
[root@SET_A ~]# turbostat -i 5
pk cor CPU %c0 GHz TSC SMI %c1 %c3 %c6 %c7 CTMP PTMP %pc2 %pc3 %pc6 %pc7 Pkg_W RAM_W PKG_% RAM_%
3.15 3.18 2.39 0 3.26 0.00 93.59 0.00 40 46 49.77 0.00 0.00 0.00 46.45 22.41 0.00 0.00
0 0 0 99.99 3.19 2.39 0 0.01 0.00 0.00 0.00 40 46 0.00 0.00 0.00 0.00 29.29 12.75 0.00 0.00
Set B
[root@SET_B ~]# turbostat -i 5
pk cor CPU %c0 GHz TSC SMI %c1 %c3 %c6 %c7 CTMP PTMP %pc2 %pc3 %pc6 %pc7 Pkg_W RAM_W PKG_% RAM_%
3.14 3.18 2.39 0 3.27 0.00 93.59 0.00 38 40 49.81 0.00 0.00 0.00 46.12 21.49 0.00 0.00
0 0 0 99.99 3.19 2.39 0 0.01 0.00 0.00 0.00 38 40 0.00 0.00 0.00 0.00 32.27 13.51 0.00 0.00
Simplified benchmark
To simplify the benchmark as much as possible (no FP, as few memory accesses as possible) I wrote the following 32 bit code.
.text
.global _start
_start:
movl $0x0, %ecx
oloop:
cmp $0x2, %ecx
je end
inc %ecx
movl $0xFFFFFFFF,%eax
movl $0x0, %ebx
loop: cmp %eax, %ebx
je oloop
inc %ebx
jmp loop
end:
mov $1, %eax
int $0x80
.data
value:
.long 0
It simply increments a register from 0 to 0xFFFFFFFF
twice, nothing else.
Set A
[root@SET_A ~]# md5sum simple.out
30fb3a645a8a0088ff303cf34cadea37 simple.out
[root@SET_A ~]# time taskset -c 0 ./simple.out
real 0m10.801s
user 0m10.804s
sys 0m0.001s
Set B
[root@SET_B ~]# md5sum simple.out
30fb3a645a8a0088ff303cf34cadea37 simple.out
[root@SET_B ~]# time taskset -c 0 ./simple.out
real 0m2.722s
user 0m2.724s
sys 0m0.000s
x4 difference to increment a register.
More observations with the simplified benchmark
During the benchmark, the number of interrupts is the same on both machines, ~1100 intr/s
(reported with mpstat). These are mostly Local Timer Interrupts on CPU0
, so there's basically no difference in the source of interruption.
Set A
[root@SET_A ~]# mpstat -P ALL -I SUM 1
01:00:35 PM CPU intr/s
01:00:36 PM all 1117.00
Set B
[root@SET_B ~]# mpstat -P ALL -I SUM 1
01:04:50 PM CPU intr/s
01:04:51 PM all 1112.00
For P-States
and C-States
holds the same as above.
perf analysis
Set A
Performance counter stats for 'taskset -c 0 ./simple.out':
41,383,515 instructions:k # 0.00 insns per cycle [71.42%]
34,360,528,207 instructions:u # 1.00 insns per cycle [71.42%]
63,675 cache-references [71.42%]
6,365 cache-misses # 9.996 % of all cache refs [71.43%]
34,439,207,904 cycles # 0.000 GHz [71.44%]
34,400,748,829 instructions # 1.00 insns per cycle [71.44%]
17,186,890,732 branches [71.44%]
143 page-faults
0 migrations
1,117 context-switches
10.905973410 seconds time elapsed
Set B
Performance counter stats for 'taskset -c 0 ./simple.out':
11,112,919 instructions:k # 0.00 insns per cycle [71.41%]
34,351,189,050 instructions:u # 3.99 insns per cycle [71.44%]
32,765 cache-references [71.46%]
3,266 cache-misses # 9.968 % of all cache refs [71.47%]
8,600,461,054 cycles # 0.000 GHz [71.46%]
34,378,806,261 instructions # 4.00 insns per cycle [71.41%]
17,192,017,836 branches [71.37%]
143 faults
2 migrations
281 context-switches
2.740606064 seconds time elapsed
Observations
sys_exit
. Clearly the number of context switches is higher for Set A.~10M
). This might be caused by the same reason as above? Instructions which lead to a reschedule which are accounted as user space. Or instructions in interrupt context?0.06%
higher, but the number of L3 cache references
is double. Is this expected? A quick check on cache configuration leads to the same result. Cache are correctly enabled (CR0: 0x80050033
, CD is 0) and MTRR
configuration is identical.Final questions
Is there an obvious reason that can explain this difference in performance?
Why machines in Set A are running at 1 instr per cycle while machines in Set B are running at 4 instr per cycle, given the fact that the hardware/software configuration is identical?
Why rebooting a machine seems to fix this behavior? As explained in the introduction, if I reboot a machine in Set A, it starts performing as expected.
The cause here is either too trivial that I missed it or too complex that can't really be explained. I hope it's the former, any help/hint/suggestion is appreciated.
Of course it would be great if someone could give an answer which allows to fix your problem right away. But I fear, there is no obvious answer. But there might be some directions of attacking the problem not attempted yet:
Under the hypothesis, that some of your machines sometimes show the behavior and others never do, one could conclude subtle hardware issues (sporadic, undetectable by means of system diagnostics). The counter hypothesis would be that this can happen to any of your machines.
If the effect also shows on "good" machines, there must be subtle differences in the installation (HD content) or the hard disk is the problem.
If the effect never shows on "good" machines, the "subtle hardware issue" theory is kind of confirmed.
Put the good machines to the physical locations of the bad ones and see what happens.
Step 1 should help finding such subtle differences, identifying which of your machines work and which do not. Step 3 is about finding the explanation.
User contributions licensed under CC BY-SA 3.0