Same hardware/software, significant difference in performance

Question

Same hardware/software, significant difference in performance

Introduction

I have two sets of machines, set A and set B, with apparently the same hardware/software configuration, but with a significant difference in performance. Machines in set B are up to x4 faster than machines in set A. However, if I reboot a machine in set A, inexplicably it starts performing as expected, like machines in set B. I can't find an explanation for this behavior.

Hardware configuration

Intel S2600KP platform
Dual processor E5-2630v3, 8 physical cores, 2.4GHz base frequency, 3.2GHz Turbo frequency.
8x8GB RAM DDR4, 2133Mhz, one module per channel
None of the machine has SEL events logged by the BMC that could point to a hardware issues.
None of the machine is triggering any Machine Check Exception during the whole duration of the benchmarks.

All the hardware components are identical in Model/Part Number.

Settings and software

BIOS version and BIOS settings are identical, e.g. HT is enabled, Turbo Boost is enabled. See the link for details.
The machines are running the same 64 bits version of Red Hat 6 with kernel 2.6.32-504.

The Problem

Machines in both sets are idle, but whatever load I try to run, I get very different results in terms of performance. For the sake of simplicity, I will run all the benchmarks on core 0. The result is reproducible on all cores (both processors).

Set A

[root@SET_A ~]# uptime
 11:48:40 up 51 days, 19:34,  2 users,  load average: 0.00, 0.00, 0.00

[root@SET_A ~]#  taskset -c 0 sh -c 'time echo "scale=5000; a(1)*4" | bc -l > /dev/null'

real    0m43.751s
user    0m43.742s
sys     0m0.005s

Set B

[root@SET_B ~]# uptime
11:50:00 up 15 days, 19:43,  1 user,  load average: 0.00, 0.00, 0.00

[root@SET_B ~]# taskset -c 0 sh -c 'time echo "scale=5000; a(1)*4" | bc -l > /dev/null'

real    0m18.648s
user    0m18.646s
sys     0m0.004s

Observations

turbostats reports core 0 as being in C0 Consumption state and in P0 Performance state with Turbo Frequency enabled for the whole duration of the benchmark.

Set A

[root@SET_A ~]# turbostat  -i 5
pk cor CPU    %c0  GHz  TSC SMI    %c1    %c3    %c6    %c7 CTMP PTMP   %pc2   %pc3   %pc6   %pc7  Pkg_W RAM_W PKG_% RAM_%
             3.15 3.18 2.39   0   3.26   0.00  93.59   0.00   40   46  49.77   0.00   0.00   0.00  46.45 22.41  0.00  0.00
 0   0   0  99.99 3.19 2.39   0   0.01   0.00   0.00   0.00   40   46   0.00   0.00   0.00   0.00  29.29 12.75  0.00  0.00

Set B

[root@SET_B ~]# turbostat  -i 5
pk cor CPU    %c0  GHz  TSC SMI    %c1    %c3    %c6    %c7 CTMP PTMP   %pc2   %pc3   %pc6   %pc7  Pkg_W RAM_W PKG_% RAM_%
             3.14 3.18 2.39   0   3.27   0.00  93.59   0.00   38   40  49.81   0.00   0.00   0.00  46.12 21.49  0.00  0.00
 0   0   0  99.99 3.19 2.39   0   0.01   0.00   0.00   0.00   38   40   0.00   0.00   0.00   0.00  32.27 13.51  0.00  0.00

Simplified benchmark

To simplify the benchmark as much as possible (no FP, as few memory accesses as possible) I wrote the following 32 bit code.

.text
.global _start

_start:
      movl    $0x0, %ecx
oloop:
      cmp $0x2, %ecx
      je end
      inc %ecx
      movl    $0xFFFFFFFF,%eax
      movl    $0x0, %ebx
loop: cmp     %eax, %ebx
      je      oloop
      inc     %ebx
      jmp     loop

end:
      mov $1, %eax
      int $0x80
.data
value:
        .long 0

It simply increments a register from 0 to 0xFFFFFFFF twice, nothing else.

Set A

[root@SET_A ~]# md5sum simple.out 
30fb3a645a8a0088ff303cf34cadea37  simple.out
[root@SET_A ~]# time taskset -c 0 ./simple.out

real    0m10.801s
user    0m10.804s
sys     0m0.001s

Set B

[root@SET_B ~]# md5sum simple.out 
30fb3a645a8a0088ff303cf34cadea37  simple.out
[root@SET_B ~]# time taskset -c 0 ./simple.out

real    0m2.722s
user    0m2.724s
sys     0m0.000s

x4 difference to increment a register.

More observations with the simplified benchmark

During the benchmark, the number of interrupts is the same on both machines, ~1100 intr/s (reported with mpstat). These are mostly Local Timer Interrupts on CPU0, so there's basically no difference in the source of interruption.

Set A

[root@SET_A ~]# mpstat -P ALL -I SUM 1
01:00:35 PM  CPU    intr/s
01:00:36 PM  all   1117.00

Set B

[root@SET_B ~]# mpstat -P ALL -I SUM 1
01:04:50 PM  CPU    intr/s
01:04:51 PM  all   1112.00

For P-States and C-States holds the same as above.

perf analysis

Set A

 Performance counter stats for 'taskset -c 0 ./simple.out':

        41,383,515 instructions:k            #    0.00  insns per cycle         [71.42%]
    34,360,528,207 instructions:u            #    1.00  insns per cycle         [71.42%]
            63,675 cache-references                                             [71.42%]
             6,365 cache-misses              #    9.996 % of all cache refs     [71.43%]
    34,439,207,904 cycles                    #    0.000 GHz                     [71.44%]
    34,400,748,829 instructions              #    1.00  insns per cycle         [71.44%]
    17,186,890,732 branches                                                     [71.44%]
               143 page-faults                                                 
                 0 migrations                                                  
             1,117 context-switches                                            

      10.905973410 seconds time elapsed

Set B

 Performance counter stats for 'taskset -c 0 ./simple.out':

        11,112,919 instructions:k            #    0.00  insns per cycle         [71.41%]
    34,351,189,050 instructions:u            #    3.99  insns per cycle         [71.44%]
            32,765 cache-references                                             [71.46%]
             3,266 cache-misses              #    9.968 % of all cache refs     [71.47%]
     8,600,461,054 cycles                    #    0.000 GHz                     [71.46%]
    34,378,806,261 instructions              #    4.00  insns per cycle         [71.41%]
    17,192,017,836 branches                                                     [71.37%]
               143 faults                                                      
                 2 migrations                                                  
               281 context-switches                                            

       2.740606064 seconds time elapsed

Observations

Number of kernel space instructions is different due to the control paths which lead to a reschedule. There are no system calls involved apart from a final sys_exit. Clearly the number of context switches is higher for Set A.
There's also a very small difference in user space instructions (~10M). This might be caused by the same reason as above? Instructions which lead to a reschedule which are accounted as user space. Or instructions in interrupt context?
Total number of instructions for Set A is 0.06% higher, but the number of L3 cache references is double. Is this expected? A quick check on cache configuration leads to the same result. Cache are correctly enabled (CR0: 0x80050033, CD is 0) and MTRR configuration is identical.
Probably the most interesting value is the instr per cycle. 1 inst per cycle on Set A, 4 inst per cycle on Set B.

Final questions

Is there an obvious reason that can explain this difference in performance?
Why machines in Set A are running at 1 instr per cycle while machines in Set B are running at 4 instr per cycle, given the fact that the hardware/software configuration is identical?
Why rebooting a machine seems to fix this behavior? As explained in the introduction, if I reboot a machine in Set A, it starts performing as expected.

The cause here is either too trivial that I missed it or too complex that can't really be explained. I hope it's the former, any help/hint/suggestion is appreciated.

performance

hardware

asked on Server Fault Apr 4, 2015 by

Marco Guerri • edited Jan 28, 2017 by

Marco Guerri

1 Answer

Of course it would be great if someone could give an answer which allows to fix your problem right away. But I fear, there is no obvious answer. But there might be some directions of attacking the problem not attempted yet:

Under the hypothesis, that some of your machines sometimes show the behavior and others never do, one could conclude subtle hardware issues (sporadic, undetectable by means of system diagnostics). The counter hypothesis would be that this can happen to any of your machines.
- Step 1: Make sure all parts of machines can be tracked in what is to come.
- Step 2: Pick some of the allegedly "good" machines and run them with the hard drives of some "bad" machines and see if the effect shows.
- Step 3: Control step. Run "bad" machines with the hard drives of "good" ones and see if the problem shows.

If the effect also shows on "good" machines, there must be subtle differences in the installation (HD content) or the hard disk is the problem.

If the effect never shows on "good" machines, the "subtle hardware issue" theory is kind of confirmed.

Look for those differences which "really and clearly cannot cause that - never!". There is the story about the GM engineer who was sent to investigate why a GM car fails to start whenever a owner of that car bought strawberry ice cream but never when he bought vanilla in a shop... In your case, look at where the good and bad machines are placed. More ventilation of the room, less? Closer to some windows and direct sunlight? Closer/further away from some electrical installations (climate control,...)? Different power supply nets of the building? More stuff connected to some power control paths/USV? ... It might not appear to be reasonable at first glance what I suggest here, but hey - your problem is rather sporadic, right? (50 days uptime and your machines "turn" during that period..).

Put the good machines to the physical locations of the bad ones and see what happens.

Some of my more painful working days as a embedded software engineer came from people swapping "identical equipment" on my working desk... and me spending half a day - half a week trying to find out why my stuff is not working as it did before I had left for the night... Sometimes even the tiniest revisions of some component (or software/configuration version flashed into it) can have some observable effects. In some of my cases along that line, not even the manufacturer was aware of those "unimportant" changes - sometimes only in some PCB layout or a change of the capacitor supplier to a second source...

Step 1 should help finding such subtle differences, identifying which of your machines work and which do not. Step 3 is about finding the explanation.

Try to assess the "MTBF" and analyse it for a pattern. As you can clearly see when a machine starts to slow down, you could measure the average time it takes the machine to go slow. Then, it can be insightful if different machines have different times to failure, if the times are clustering around a value or if there is no pattern at all (totally random when it happens) ...

answered on Server Fault Apr 4, 2015 by

BitTickler

User contributions licensed under CC BY-SA 3.0