counting L1 cache misses with PAPI_read_counters gives unexpected results

3

I am trying to use PAPI library to count cache misses. cache hit performance counter is not available on my hardware, that's why I am trying to determine cache hits with no cache misses. I am trying few things. First version of my code is this:

  int numEvents = 2;

  long long values[2];

  int events[2] = {PAPI_L1_DCM, PAPI_L2_TCM};


 if (PAPI_start_counters(events, numEvents) != PAPI_OK )  // !=PAPI_OK

    printf("PAPI error: %d\n", 1);

 for(int i=0; i < arr_size; i++)
  {
    array[i].value = 1;

  }

_mm_mfence();

if ((ret1 = PAPI_read_counters(values, numEvents)) != PAPI_OK) {
   fprintf(stderr, "PAPI failed to read counters: %s\n", PAPI_strerror(ret1));
   exit(1);
}
miss1 = values[0];

_mm_mfence();

for(int i=0; i < arr_size; i++){
         array[i].value = array[i].value + 9; // (int) sum
}

_mm_mfence();

if ((ret2 = PAPI_read_counters(values, numEvents)) != PAPI_OK) {
    fprintf(stderr, "PAPI failed to read counters: %s\n", PAPI_strerror(ret2));
    exit(1);
}

miss2 = values[0];

printf("before flush miss_1 %lli, miss_2 %lli \n", miss1, miss2);

the problem is that this piece of code is supposed to give me cache hits, so L1 cache miss should be extremely low. but I get unexpectedly high results for miss_2. With array size of 200, miss_2 is nearly 100. it doesn't give any valid result to judge that it really was hit, because of high number of cache misses.

I also tried to rewrite it like this:

if (PAPI_start_counters(events, numEvents) != PAPI_OK )  // !=PAPI_OK

     printf("PAPI error: %d\n", 1);

for(int i=0; i < arr_size; i++){
         array[i].value = array[i].value + 9; // (int) sum
}

if ( PAPI_stop_counters(values, numEvents) != PAPI_OK)
   printf("PAPI error: 2\n");

printf("before flush miss %lli\n", values[0]);

but this gives even worse result, miss_2 is more than 200. Is there anything I am not doing right? It was supposed to give more precise result, but it's doing terrible now. Or I am missing something.
I have tried without fences, I am sure that at least they don't do any harm. I would really appreciate any suggestion.

The disadvantage of PAPI_read_counters is it's overhead, and not great performance, but now I don't care abut performance, I want to correctly determine cache hits.

Though I was also thinking to use RDMPC but I have not found an example to use it without _asm function overwriting. Is this really the only way to use rdpmc? there does not exist already defined function which I would not have to overwrite?

EDIT: adding compiler code for PAPI_read

    ./prog6:     file format elf64-x86-64


Disassembly of section .init:

00000000000009c0 <_init>:
 9c0:   48 83 ec 08             sub    $0x8,%rsp
 9c4:   48 8b 05 1d 16 20 00    mov    0x20161d(%rip),%rax        # 201fe8 <__gmon_start__>
 9cb:   48 85 c0                test   %rax,%rax
 9ce:   74 02                   je     9d2 <_init+0x12>
 9d0:   ff d0                   callq  *%rax
 9d2:   48 83 c4 08             add    $0x8,%rsp
 9d6:   c3                      retq   

Disassembly of section .plt:

00000000000009e0 <.plt>:
 9e0:   ff 35 6a 15 20 00       pushq  0x20156a(%rip)        # 201f50 <_GLOBAL_OFFSET_TABLE_+0x8>
 9e6:   ff 25 6c 15 20 00       jmpq   *0x20156c(%rip)        # 201f58 <_GLOBAL_OFFSET_TABLE_+0x10>
 9ec:   0f 1f 40 00             nopl   0x0(%rax)

00000000000009f0 <puts@plt>:
 9f0:   ff 25 6a 15 20 00       jmpq   *0x20156a(%rip)        # 201f60 <puts@GLIBC_2.2.5>
 9f6:   68 00 00 00 00          pushq  $0x0
 9fb:   e9 e0 ff ff ff          jmpq   9e0 <.plt>

0000000000000a00 <clock_gettime@plt>:
 a00:   ff 25 62 15 20 00       jmpq   *0x201562(%rip)        # 201f68 <clock_gettime@GLIBC_2.17>
 a06:   68 01 00 00 00          pushq  $0x1
 a0b:   e9 d0 ff ff ff          jmpq   9e0 <.plt>

0000000000000a10 <getpid@plt>:
 a10:   ff 25 5a 15 20 00       jmpq   *0x20155a(%rip)        # 201f70 <getpid@GLIBC_2.2.5>
 a16:   68 02 00 00 00          pushq  $0x2
 a1b:   e9 c0 ff ff ff          jmpq   9e0 <.plt>

0000000000000a20 <__stack_chk_fail@plt>:
 a20:   ff 25 52 15 20 00       jmpq   *0x201552(%rip)        # 201f78 <__stack_chk_fail@GLIBC_2.4>
 a26:   68 03 00 00 00          pushq  $0x3
 a2b:   e9 b0 ff ff ff          jmpq   9e0 <.plt>

0000000000000a30 <PAPI_read_counters@plt>:
 a30:   ff 25 4a 15 20 00       jmpq   *0x20154a(%rip)        # 201f80 <PAPI_read_counters>
 a36:   68 04 00 00 00          pushq  $0x4
 a3b:   e9 a0 ff ff ff          jmpq   9e0 <.plt>

0000000000000a40 <sched_setaffinity@plt>:
 a40:   ff 25 42 15 20 00       jmpq   *0x201542(%rip)        # 201f88 <sched_setaffinity@GLIBC_2.3.4>
 a46:   68 05 00 00 00          pushq  $0x5
 a4b:   e9 90 ff ff ff          jmpq   9e0 <.plt>

0000000000000a50 <PAPI_start_counters@plt>:
 a50:   ff 25 3a 15 20 00       jmpq   *0x20153a(%rip)        # 201f90 <PAPI_start_counters>
 a56:   68 06 00 00 00          pushq  $0x6
 a5b:   e9 80 ff ff ff          jmpq   9e0 <.plt>

0000000000000a60 <PAPI_stop_counters@plt>:
 a60:   ff 25 32 15 20 00       jmpq   *0x201532(%rip)        # 201f98 <PAPI_stop_counters>
 a66:   68 07 00 00 00          pushq  $0x7
 a6b:   e9 70 ff ff ff          jmpq   9e0 <.plt>

0000000000000a70 <malloc@plt>:
 a70:   ff 25 2a 15 20 00       jmpq   *0x20152a(%rip)        # 201fa0 <malloc@GLIBC_2.2.5>
 a76:   68 08 00 00 00          pushq  $0x8
 a7b:   e9 60 ff ff ff          jmpq   9e0 <.plt>

0000000000000a80 <PAPI_strerror@plt>:
 a80:   ff 25 22 15 20 00       jmpq   *0x201522(%rip)        # 201fa8 <PAPI_strerror>
 a86:   68 09 00 00 00          pushq  $0x9
 a8b:   e9 50 ff ff ff          jmpq   9e0 <.plt>

0000000000000a90 <__printf_chk@plt>:
 a90:   ff 25 1a 15 20 00       jmpq   *0x20151a(%rip)        # 201fb0 <__printf_chk@GLIBC_2.3.4>
 a96:   68 0a 00 00 00          pushq  $0xa
 a9b:   e9 40 ff ff ff          jmpq   9e0 <.plt>

0000000000000aa0 <getrusage@plt>:
 aa0:   ff 25 12 15 20 00       jmpq   *0x201512(%rip)        # 201fb8 <getrusage@GLIBC_2.2.5>
 aa6:   68 0b 00 00 00          pushq  $0xb
 aab:   e9 30 ff ff ff          jmpq   9e0 <.plt>

0000000000000ab0 <exit@plt>:
 ab0:   ff 25 0a 15 20 00       jmpq   *0x20150a(%rip)        # 201fc0 <exit@GLIBC_2.2.5>
 ab6:   68 0c 00 00 00          pushq  $0xc
 abb:   e9 20 ff ff ff          jmpq   9e0 <.plt>

0000000000000ac0 <fwrite@plt>:
 ac0:   ff 25 02 15 20 00       jmpq   *0x201502(%rip)        # 201fc8 <fwrite@GLIBC_2.2.5>
 ac6:   68 0d 00 00 00          pushq  $0xd
 acb:   e9 10 ff ff ff          jmpq   9e0 <.plt>

0000000000000ad0 <__fprintf_chk@plt>:
 ad0:   ff 25 fa 14 20 00       jmpq   *0x2014fa(%rip)        # 201fd0 <__fprintf_chk@GLIBC_2.3.4>
 ad6:   68 0e 00 00 00          pushq  $0xe
 adb:   e9 00 ff ff ff          jmpq   9e0 <.plt>

Disassembly of section .plt.got:

0000000000000ae0 <__cxa_finalize@plt>:
 ae0:   ff 25 12 15 20 00       jmpq   *0x201512(%rip)        # 201ff8 <__cxa_finalize@GLIBC_2.2.5>
 ae6:   66 90                   xchg   %ax,%ax

Disassembly of section .text:

0000000000000af0 <main>:
     af0:   41 57                   push   %r15
     af2:   b9 0f 00 00 00          mov    $0xf,%ecx
     af7:   41 56                   push   %r14
     af9:   41 55                   push   %r13
     afb:   41 54                   push   %r12
     afd:   55                      push   %rbp
     afe:   53                      push   %rbx
     aff:   48 81 ec 78 01 00 00    sub    $0x178,%rsp
     b06:   64 48 8b 04 25 28 00    mov    %fs:0x28,%rax
     b0d:   00 00 
     b0f:   48 89 84 24 68 01 00    mov    %rax,0x168(%rsp)
     b16:   00 
     b17:   31 c0                   xor    %eax,%eax
     b19:   48 8d 9c 24 e0 00 00    lea    0xe0(%rsp),%rbx
     b20:   00 
     b21:   48 b8 00 00 00 80 07    movabs $0x8000000780000000,%rax
     b28:   00 00 80 
     b2b:   48 c7 84 24 e0 00 00    movq   $0x1,0xe0(%rsp)
     b32:   00 01 00 00 00 
     b37:   48 8d 53 08             lea    0x8(%rbx),%rdx
     b3b:   48 89 84 24 c8 00 00    mov    %rax,0xc8(%rsp)
     b42:   00 
     b43:   31 c0                   xor    %eax,%eax
     b45:   48 89 d7                mov    %rdx,%rdi
     b48:   f3 48 ab                rep stos %rax,%es:(%rdi)
     b4b:   e8 c0 fe ff ff          callq  a10 <getpid@plt>
     b50:   48 89 da                mov    %rbx,%rdx
     b53:   be 80 00 00 00          mov    $0x80,%esi
     b58:   89 c7                   mov    %eax,%edi
     b5a:   e8 e1 fe ff ff          callq  a40 <sched_setaffinity@plt>
     b5f:   85 c0                   test   %eax,%eax
     b61:   0f 85 17 03 00 00       jne    e7e <main+0x38e>
     b67:   0f ae f0                mfence 
     b6a:   48 8d 74 24 10          lea    0x10(%rsp),%rsi
     b6f:   bf 02 00 00 00          mov    $0x2,%edi
     b74:   0f ae f0                mfence 
     b77:   e8 84 fe ff ff          callq  a00 <clock_gettime@plt>
     b7c:   0f 31                   rdtsc  
     b7e:   bf 00 fa 00 00          mov    $0xfa00,%edi
     b83:   0f ae f0                mfence 
     b86:   48 c1 e2 20             shl    $0x20,%rdx
     b8a:   49 89 c6                mov    %rax,%r14
     b8d:   49 09 d6                or     %rdx,%r14
     b90:   e8 db fe ff ff          callq  a70 <malloc@plt>
     b95:   48 8d bc 24 c8 00 00    lea    0xc8(%rsp),%rdi
     b9c:   00 
     b9d:   be 02 00 00 00          mov    $0x2,%esi
     ba2:   49 89 c4                mov    %rax,%r12
     ba5:   e8 a6 fe ff ff          callq  a50 <PAPI_start_counters@plt>
     baa:   85 c0                   test   %eax,%eax
     bac:   0f 85 88 02 00 00       jne    e3a <main+0x34a>
     bb2:   4d 89 e7                mov    %r12,%r15
     bb5:   49 8d 84 24 00 fa 00    lea    0xfa00(%r12),%rax
     bbc:   00 
     bbd:   4c 89 e5                mov    %r12,%rbp
     bc0:   c7 45 00 01 00 00 00    movl   $0x1,0x0(%rbp)
     bc7:   48 83 c5 40             add    $0x40,%rbp
     bcb:   48 39 e8                cmp    %rbp,%rax
     bce:   75 f0                   jne    bc0 <main+0xd0>
     bd0:   4c 8d ac 24 d0 00 00    lea    0xd0(%rsp),%r13
     bd7:   00 
     bd8:   be 02 00 00 00          mov    $0x2,%esi
     bdd:   4c 89 ef                mov    %r13,%rdi
     be0:   e8 4b fe ff ff          callq  a30 <PAPI_read_counters@plt>
     be5:   85 c0                   test   %eax,%eax
     be7:   0f 85 b8 02 00 00       jne    ea5 <main+0x3b5>
     bed:   48 8b 84 24 d0 00 00    mov    0xd0(%rsp),%rax
     bf4:   00 
     bf5:   4c 89 e3                mov    %r12,%rbx
     bf8:   48 89 44 24 08          mov    %rax,0x8(%rsp)
     bfd:   0f 1f 00                nopl   (%rax)
     c00:   83 03 09                addl   $0x9,(%rbx)
     c03:   48 83 c3 40             add    $0x40,%rbx
     c07:   48 39 dd                cmp    %rbx,%rbp
     c0a:   75 f4                   jne    c00 <main+0x110>
     c0c:   31 d2                   xor    %edx,%edx
     c0e:   48 8d 35 88 04 00 00    lea    0x488(%rip),%rsi        # 109d <_IO_stdin_used+0x2d>
     c15:   bf 01 00 00 00          mov    $0x1,%edi
     c1a:   31 c0                   xor    %eax,%eax
     c1c:   e8 6f fe ff ff          callq  a90 <__printf_chk@plt>
     c21:   be 02 00 00 00          mov    $0x2,%esi
     c26:   4c 89 ef                mov    %r13,%rdi
     c29:   e8 02 fe ff ff          callq  a30 <PAPI_read_counters@plt>
     c2e:   85 c0                   test   %eax,%eax
     c30:   0f 85 6f 02 00 00       jne    ea5 <main+0x3b5>
     c36:   48 8b 8c 24 d0 00 00    mov    0xd0(%rsp),%rcx
     c3d:   00 
     c3e:   48 8b 54 24 08          mov    0x8(%rsp),%rdx
     c43:   48 8d 35 e6 04 00 00    lea    0x4e6(%rip),%rsi        # 1130 <_IO_stdin_used+0xc0>
     c4a:   31 c0                   xor    %eax,%eax
     c4c:   bf 01 00 00 00          mov    $0x1,%edi
     c51:   e8 3a fe ff ff          callq  a90 <__printf_chk@plt>
     c56:   66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
     c5d:   00 00 00 
     c60:   41 0f ae 3c 24          clflush (%r12)
     c65:   49 83 c4 40             add    $0x40,%r12
     c69:   49 39 dc                cmp    %rbx,%r12
     c6c:   75 f2                   jne    c60 <main+0x170>
     c6e:   be 02 00 00 00          mov    $0x2,%esi
     c73:   4c 89 ef                mov    %r13,%rdi
     c76:   e8 b5 fd ff ff          callq  a30 <PAPI_read_counters@plt>
     c7b:   85 c0                   test   %eax,%eax
     c7d:   0f 85 22 02 00 00       jne    ea5 <main+0x3b5>
     c83:   48 8b ac 24 d0 00 00    mov    0xd0(%rsp),%rbp
     c8a:   00 
     c8b:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
     c90:   41 83 07 09             addl   $0x9,(%r15)
     c94:   49 83 c7 40             add    $0x40,%r15
     c98:   49 39 df                cmp    %rbx,%r15
     c9b:   75 f3                   jne    c90 <main+0x1a0>
     c9d:   be 02 00 00 00          mov    $0x2,%esi
     ca2:   4c 89 ef                mov    %r13,%rdi
     ca5:   e8 86 fd ff ff          callq  a30 <PAPI_read_counters@plt>
     caa:   85 c0                   test   %eax,%eax
     cac:   0f 85 f3 01 00 00       jne    ea5 <main+0x3b5>
     cb2:   48 8b 8c 24 d0 00 00    mov    0xd0(%rsp),%rcx
     cb9:   00 
     cba:   48 8d 35 97 04 00 00    lea    0x497(%rip),%rsi        # 1158 <_IO_stdin_used+0xe8>
     cc1:   bf 01 00 00 00          mov    $0x1,%edi
     cc6:   31 c0                   xor    %eax,%eax
     cc8:   48 89 ea                mov    %rbp,%rdx
     ccb:   e8 c0 fd ff ff          callq  a90 <__printf_chk@plt>
     cd0:   be 02 00 00 00          mov    $0x2,%esi
     cd5:   4c 89 ef                mov    %r13,%rdi
     cd8:   e8 83 fd ff ff          callq  a60 <PAPI_stop_counters@plt>
     cdd:   85 c0                   test   %eax,%eax
     cdf:   0f 85 72 01 00 00       jne    e57 <main+0x367>
     ce5:   0f ae f0                mfence 
     ce8:   0f 31                   rdtsc  
     cea:   bf 02 00 00 00          mov    $0x2,%edi
     cef:   48 c1 e2 20             shl    $0x20,%rdx
     cf3:   48 89 c3                mov    %rax,%rbx
     cf6:   48 8d 74 24 20          lea    0x20(%rsp),%rsi
     cfb:   48 09 d3                or     %rdx,%rbx
     cfe:   e8 fd fc ff ff          callq  a00 <clock_gettime@plt>
     d03:   bf 01 00 00 00          mov    $0x1,%edi
     d08:   48 be db 34 b6 d7 82    movabs $0x431bde82d7b634db,%rsi
     d0f:   de 1b 43 
     d12:   0f ae f0                mfence 
     d15:   48 8b 4c 24 20          mov    0x20(%rsp),%rcx
     d1a:   48 2b 4c 24 10          sub    0x10(%rsp),%rcx
     d1f:   48 69 c9 00 ca 9a 3b    imul   $0x3b9aca00,%rcx,%rcx
     d26:   48 03 4c 24 28          add    0x28(%rsp),%rcx
     d2b:   48 2b 4c 24 18          sub    0x18(%rsp),%rcx
     d30:   48 89 c8                mov    %rcx,%rax
     d33:   48 c1 f9 3f             sar    $0x3f,%rcx
     d37:   48 f7 ee                imul   %rsi
     d3a:   48 8d 35 3f 04 00 00    lea    0x43f(%rip),%rsi        # 1180 <_IO_stdin_used+0x110>
     d41:   31 c0                   xor    %eax,%eax
     d43:   48 c1 fa 12             sar    $0x12,%rdx
     d47:   48 29 ca                sub    %rcx,%rdx
     d4a:   e8 41 fd ff ff          callq  a90 <__printf_chk@plt>
     d4f:   48 89 da                mov    %rbx,%rdx
     d52:   bf 01 00 00 00          mov    $0x1,%edi
     d57:   31 c0                   xor    %eax,%eax
     d59:   4c 29 f2                sub    %r14,%rdx
     d5c:   48 8d 35 53 03 00 00    lea    0x353(%rip),%rsi        # 10b6 <_IO_stdin_used+0x46>
     d63:   e8 28 fd ff ff          callq  a90 <__printf_chk@plt>
     d68:   31 d2                   xor    %edx,%edx
     d6a:   48 8d 35 56 03 00 00    lea    0x356(%rip),%rsi        # 10c7 <_IO_stdin_used+0x57>
     d71:   31 c0                   xor    %eax,%eax
     d73:   bf 01 00 00 00          mov    $0x1,%edi
     d78:   e8 13 fd ff ff          callq  a90 <__printf_chk@plt>
     d7d:   31 ff                   xor    %edi,%edi
     d7f:   48 8d 74 24 30          lea    0x30(%rsp),%rsi
     d84:   e8 17 fd ff ff          callq  aa0 <getrusage@plt>
     d89:   83 f8 ff                cmp    $0xffffffff,%eax
     d8c:   0f 84 d6 00 00 00       je     e68 <main+0x378>
     d92:   48 8b 8c 24 b8 00 00    mov    0xb8(%rsp),%rcx
     d99:   00 
     d9a:   48 8b 94 24 b0 00 00    mov    0xb0(%rsp),%rdx
     da1:   00 
     da2:   48 8d 35 3e 03 00 00    lea    0x33e(%rip),%rsi        # 10e7 <_IO_stdin_used+0x77>
     da9:   31 c0                   xor    %eax,%eax
     dab:   bf 01 00 00 00          mov    $0x1,%edi
     db0:   e8 db fc ff ff          callq  a90 <__printf_chk@plt>
     db5:   c5 f9 57 c0             vxorpd %xmm0,%xmm0,%xmm0
     db9:   bf 01 00 00 00          mov    $0x1,%edi
     dbe:   c5 fb 10 0d 12 04 00    vmovsd 0x412(%rip),%xmm1        # 11d8 <_IO_stdin_used+0x168>
     dc5:   00 
     dc6:   48 69 44 24 30 40 42    imul   $0xf4240,0x30(%rsp),%rax
     dcd:   0f 00 
     dcf:   48 03 44 24 38          add    0x38(%rsp),%rax
     dd4:   48 8d 35 d5 03 00 00    lea    0x3d5(%rip),%rsi        # 11b0 <_IO_stdin_used+0x140>
     ddb:   c4 e1 fb 2a c0          vcvtsi2sd %rax,%xmm0,%xmm0
     de0:   48 69 54 24 40 40 42    imul   $0xf4240,0x40(%rsp),%rdx
     de7:   0f 00 
     de9:   48 03 54 24 48          add    0x48(%rsp),%rdx
     dee:   c5 fb 59 c1             vmulsd %xmm1,%xmm0,%xmm0
     df2:   c4 e1 fb 2c c0          vcvttsd2si %xmm0,%rax
     df7:   c5 f9 57 c0             vxorpd %xmm0,%xmm0,%xmm0
     dfb:   c4 e1 fb 2a c2          vcvtsi2sd %rdx,%xmm0,%xmm0
     e00:   c5 fb 59 c1             vmulsd %xmm1,%xmm0,%xmm0
     e04:   c4 e1 fb 2c d0          vcvttsd2si %xmm0,%rdx
     e09:   48 01 c2                add    %rax,%rdx
     e0c:   31 c0                   xor    %eax,%eax
     e0e:   e8 7d fc ff ff          callq  a90 <__printf_chk@plt>
     e13:   31 c0                   xor    %eax,%eax
     e15:   48 8b 8c 24 68 01 00    mov    0x168(%rsp),%rcx
     e1c:   00 
     e1d:   64 48 33 0c 25 28 00    xor    %fs:0x28,%rcx
     e24:   00 00 
     e26:   75 51                   jne    e79 <main+0x389>
     e28:   48 81 c4 78 01 00 00    add    $0x178,%rsp
     e2f:   5b                      pop    %rbx
     e30:   5d                      pop    %rbp
     e31:   41 5c                   pop    %r12
     e33:   41 5d                   pop    %r13
     e35:   41 5e                   pop    %r14
     e37:   41 5f                   pop    %r15
     e39:   c3                      retq   
     e3a:   ba 01 00 00 00          mov    $0x1,%edx
     e3f:   48 8d 35 47 02 00 00    lea    0x247(%rip),%rsi        # 108d <_IO_stdin_used+0x1d>
     e46:   bf 01 00 00 00          mov    $0x1,%edi
     e4b:   31 c0                   xor    %eax,%eax
     e4d:   e8 3e fc ff ff          callq  a90 <__printf_chk@plt>
     e52:   e9 5b fd ff ff          jmpq   bb2 <main+0xc2>
     e57:   48 8d 3d 4a 02 00 00    lea    0x24a(%rip),%rdi        # 10a8 <_IO_stdin_used+0x38>
     e5e:   e8 8d fb ff ff          callq  9f0 <puts@plt>
     e63:   e9 7d fe ff ff          jmpq   ce5 <main+0x1f5>
     e68:   48 8d 3d 62 02 00 00    lea    0x262(%rip),%rdi        # 10d1 <_IO_stdin_used+0x61>
     e6f:   e8 7c fb ff ff          callq  9f0 <puts@plt>
     e74:   e9 19 ff ff ff          jmpq   d92 <main+0x2a2>
     e79:   e8 a2 fb ff ff          callq  a20 <__stack_chk_fail@plt>
     e7e:   48 8b 0d 9b 11 20 00    mov    0x20119b(%rip),%rcx        # 202020 <stderr@@GLIBC_2.2.5>
     e85:   ba 18 00 00 00          mov    $0x18,%edx
     e8a:   be 01 00 00 00          mov    $0x1,%esi
     e8f:   48 8d 3d de 01 00 00    lea    0x1de(%rip),%rdi        # 1074 <_IO_stdin_used+0x4>
     e96:   e8 25 fc ff ff          callq  ac0 <fwrite@plt>
     e9b:   bf 01 00 00 00          mov    $0x1,%edi
     ea0:   e8 0b fc ff ff          callq  ab0 <exit@plt>
     ea5:   89 c7                   mov    %eax,%edi
     ea7:   e8 d4 fb ff ff          callq  a80 <PAPI_strerror@plt>
     eac:   48 8b 3d 6d 11 20 00    mov    0x20116d(%rip),%rdi        # 202020 <stderr@@GLIBC_2.2.5>
     eb3:   be 01 00 00 00          mov    $0x1,%esi
     eb8:   48 8d 15 49 02 00 00    lea    0x249(%rip),%rdx        # 1108 <_IO_stdin_used+0x98>
     ebf:   48 89 c1                mov    %rax,%rcx
     ec2:   31 c0                   xor    %eax,%eax
     ec4:   e8 07 fc ff ff          callq  ad0 <__fprintf_chk@plt>
     ec9:   bf 01 00 00 00          mov    $0x1,%edi
     ece:   e8 dd fb ff ff          callq  ab0 <exit@plt>
     ed3:   66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
     eda:   00 00 00 
     edd:   0f 1f 00                nopl   (%rax)

0000000000000ee0 <_start>:
     ee0:   31 ed                   xor    %ebp,%ebp
     ee2:   49 89 d1                mov    %rdx,%r9
     ee5:   5e                      pop    %rsi
     ee6:   48 89 e2                mov    %rsp,%rdx
     ee9:   48 83 e4 f0             and    $0xfffffffffffffff0,%rsp
     eed:   50                      push   %rax
     eee:   54                      push   %rsp
     eef:   4c 8d 05 6a 01 00 00    lea    0x16a(%rip),%r8        # 1060 <__libc_csu_fini>
     ef6:   48 8d 0d f3 00 00 00    lea    0xf3(%rip),%rcx        # ff0 <__libc_csu_init>
     efd:   48 8d 3d ec fb ff ff    lea    -0x414(%rip),%rdi        # af0 <main>
     f04:   ff 15 d6 10 20 00       callq  *0x2010d6(%rip)        # 201fe0 <__libc_start_main@GLIBC_2.2.5>
     f0a:   f4                      hlt    
     f0b:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

0000000000000f10 <deregister_tm_clones>:
     f10:   48 8d 3d f9 10 20 00    lea    0x2010f9(%rip),%rdi        # 202010 <__TMC_END__>
     f17:   55                      push   %rbp
     f18:   48 8d 05 f1 10 20 00    lea    0x2010f1(%rip),%rax        # 202010 <__TMC_END__>
     f1f:   48 39 f8                cmp    %rdi,%rax
     f22:   48 89 e5                mov    %rsp,%rbp
     f25:   74 19                   je     f40 <deregister_tm_clones+0x30>
     f27:   48 8b 05 aa 10 20 00    mov    0x2010aa(%rip),%rax        # 201fd8 <_ITM_deregisterTMCloneTable>
     f2e:   48 85 c0                test   %rax,%rax
     f31:   74 0d                   je     f40 <deregister_tm_clones+0x30>
     f33:   5d                      pop    %rbp
     f34:   ff e0                   jmpq   *%rax
     f36:   66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
     f3d:   00 00 00 
     f40:   5d                      pop    %rbp
     f41:   c3                      retq   
     f42:   0f 1f 40 00             nopl   0x0(%rax)
     f46:   66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
     f4d:   00 00 00 

0000000000000f50 <register_tm_clones>:
     f50:   48 8d 3d b9 10 20 00    lea    0x2010b9(%rip),%rdi        # 202010 <__TMC_END__>
     f57:   48 8d 35 b2 10 20 00    lea    0x2010b2(%rip),%rsi        # 202010 <__TMC_END__>
     f5e:   55                      push   %rbp
     f5f:   48 29 fe                sub    %rdi,%rsi
     f62:   48 89 e5                mov    %rsp,%rbp
     f65:   48 c1 fe 03             sar    $0x3,%rsi
     f69:   48 89 f0                mov    %rsi,%rax
     f6c:   48 c1 e8 3f             shr    $0x3f,%rax
     f70:   48 01 c6                add    %rax,%rsi
     f73:   48 d1 fe                sar    %rsi
     f76:   74 18                   je     f90 <register_tm_clones+0x40>
     f78:   48 8b 05 71 10 20 00    mov    0x201071(%rip),%rax        # 201ff0 <_ITM_registerTMCloneTable>
     f7f:   48 85 c0                test   %rax,%rax
     f82:   74 0c                   je     f90 <register_tm_clones+0x40>
     f84:   5d                      pop    %rbp
     f85:   ff e0                   jmpq   *%rax
     f87:   66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
     f8e:   00 00 
     f90:   5d                      pop    %rbp
     f91:   c3                      retq   
     f92:   0f 1f 40 00             nopl   0x0(%rax)
     f96:   66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
     f9d:   00 00 00 

0000000000000fa0 <__do_global_dtors_aux>:
     fa0:   80 3d 81 10 20 00 00    cmpb   $0x0,0x201081(%rip)        # 202028 <completed.7696>
     fa7:   75 2f                   jne    fd8 <__do_global_dtors_aux+0x38>
     fa9:   48 83 3d 47 10 20 00    cmpq   $0x0,0x201047(%rip)        # 201ff8 <__cxa_finalize@GLIBC_2.2.5>
     fb0:   00 
     fb1:   55                      push   %rbp
     fb2:   48 89 e5                mov    %rsp,%rbp
     fb5:   74 0c                   je     fc3 <__do_global_dtors_aux+0x23>
     fb7:   48 8b 3d 4a 10 20 00    mov    0x20104a(%rip),%rdi        # 202008 <__dso_handle>
     fbe:   e8 1d fb ff ff          callq  ae0 <__cxa_finalize@plt>
     fc3:   e8 48 ff ff ff          callq  f10 <deregister_tm_clones>
     fc8:   c6 05 59 10 20 00 01    movb   $0x1,0x201059(%rip)        # 202028 <completed.7696>
     fcf:   5d                      pop    %rbp
     fd0:   c3                      retq   
     fd1:   0f 1f 80 00 00 00 00    nopl   0x0(%rax)
     fd8:   f3 c3                   repz retq 
     fda:   66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)

0000000000000fe0 <frame_dummy>:
     fe0:   55                      push   %rbp
     fe1:   48 89 e5                mov    %rsp,%rbp
     fe4:   5d                      pop    %rbp
     fe5:   e9 66 ff ff ff          jmpq   f50 <register_tm_clones>
     fea:   66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)

0000000000000ff0 <__libc_csu_init>:
     ff0:   41 57                   push   %r15
     ff2:   41 56                   push   %r14
     ff4:   49 89 d7                mov    %rdx,%r15
     ff7:   41 55                   push   %r13
     ff9:   41 54                   push   %r12
     ffb:   4c 8d 25 36 0d 20 00    lea    0x200d36(%rip),%r12        # 201d38 <__frame_dummy_init_array_entry>
    1002:   55                      push   %rbp
    1003:   48 8d 2d 36 0d 20 00    lea    0x200d36(%rip),%rbp        # 201d40 <__init_array_end>
    100a:   53                      push   %rbx
    100b:   41 89 fd                mov    %edi,%r13d
    100e:   49 89 f6                mov    %rsi,%r14
    1011:   4c 29 e5                sub    %r12,%rbp
    1014:   48 83 ec 08             sub    $0x8,%rsp
    1018:   48 c1 fd 03             sar    $0x3,%rbp
    101c:   e8 9f f9 ff ff          callq  9c0 <_init>
    1021:   48 85 ed                test   %rbp,%rbp
    1024:   74 20                   je     1046 <__libc_csu_init+0x56>
    1026:   31 db                   xor    %ebx,%ebx
    1028:   0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
    102f:   00 
    1030:   4c 89 fa                mov    %r15,%rdx
    1033:   4c 89 f6                mov    %r14,%rsi
    1036:   44 89 ef                mov    %r13d,%edi
    1039:   41 ff 14 dc             callq  *(%r12,%rbx,8)
    103d:   48 83 c3 01             add    $0x1,%rbx
    1041:   48 39 dd                cmp    %rbx,%rbp
    1044:   75 ea                   jne    1030 <__libc_csu_init+0x40>
    1046:   48 83 c4 08             add    $0x8,%rsp
    104a:   5b                      pop    %rbx
    104b:   5d                      pop    %rbp
    104c:   41 5c                   pop    %r12
    104e:   41 5d                   pop    %r13
    1050:   41 5e                   pop    %r14
    1052:   41 5f                   pop    %r15
    1054:   c3                      retq   
    1055:   90                      nop
    1056:   66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
    105d:   00 00 00 

0000000000001060 <__libc_csu_fini>:
    1060:   f3 c3                   repz retq 

Disassembly of section .fini:

0000000000001064 <_fini>:
    1064:   48 83 ec 08             sub    $0x8,%rsp
    1068:   48 83 c4 08             add    $0x8,%rsp
    106c:   c3                      retq 

I have an object size of 64, and I added initialization as well:

typedef struct _object{
  int value;
  int pad_0;
  int * pad_2;
  int * pad_3;
  int * pad_4;
  int * pad_5;
  int * pad_6;
  int * pad_7;
  int * pad_8;
} object;  

object * array;
int arr_size = 1000;
array = (object *) malloc(arr_size * sizeof(object));
for(int i=0; i < arr_size; i++){
      array[i].value = 1;
    }
c
caching
x86
perf
papi
asked on Stack Overflow Feb 11, 2019 by Ana Khorguani • edited Feb 13, 2019 by Ana Khorguani

1 Answer

3

I've done some experiments using LIKWID, which is similar to PAPI, on Haswell. I found out that the calls to the functions that initialize and read the performance counters can cause more than 600 replacements in the L1 cache. Since the L1 cache has only 512 lines, this means that these functions may evict many of the lines that you would otherwise expect to be in the L1. By looking at the relatively large source code of PAPI_start_counters and _internal_hl_read_cnts, it seems to me that these functions may evict many lines from the L1, so the array elements don't survive in the L1 across these calls. I've verified this by using loads instead of stores and counting hits and misses using MEM_LOAD_RETIRED.*. I think the solution would be to use the RDPMC instruction. I have not used this instruction directly before. The code snippets here look useful.

Alternatively, you can put two copies of the loop after PAPI_start_counters/PAPI_read_counters and then subtract from the results the counts for one copy of the loop. This method works well.

By the way, the L1D.REPLACEMENT counter seems to be fairly accurate on Haswell when the number of cache lines accessed is about larger than 10. Perhaps the count would be exact by using RDPMC.


From your previous question, it seems that you're on Skylake. According to the PAPI event mapping, PAPI_L1_DCM and PAPI_L2_TCM are mapped to L1D.REPLACEMENT and LONGEST_LAT_CACHE.REFERENCE performance monitoring events on Intel processors. These are defined in the Intel manual as follows:

L1D.REPLACEMENT: Counts L1D data line replacements including opportunistic replacements, and replacements that require stall-for-replace or block-for-replace.

LONGEST_LAT_CACHE.REFERENCE: This event counts core-originated cacheable demand requests that refer to the last level cache (LLC). Demand requests include loads, RFOs, and hardware prefetches from L1D, and instruction fetches from IFU.

Without getting into the details of when these events exactly occur, there are three important points here that are relevant to your question:

  • Both events are counted at the cache-line granularity, not x86 instruction or load uop granularities.
  • These events may occur due to the L1D hardware prefetchers. This can impact miss2.
  • There is no way to count L1D hits at the cache line granularity for a specific physical or logical core using these events (or any other set of events on SnB-based micoarchitectures).

On Skylake, there are other native events that you can use to count L1D misses and hits per load instruction. You can use MEM_LOAD_RETIRED.L1_HIT to count the number of retired load instructions that hit in the L1D. You can use MEM_INST_RETIRED.ALL_LOADS-MEM_LOAD_RETIRED.L1_HIT to count the number of retired load instructions that miss in the L1D. There doesn't seem to be PAPI events for them. According to the documentation, you can pass native event codes to PAPIF_start_counters.

Another issue is that it's not clear to me whether PAPIF_start_counters by default will count only user events of both kernel and user events. It seems that you can use PAPI_create_eventset to control the counting domain.

The calls to PAPI APIs can also impact the event counts. You can try to measure this using an empty block as follows:

if ((ret1 = PAPI_read_counters(values, numEvents)) != PAPI_OK) {
   fprintf(stderr, "PAPI failed to read counters: %s\n", PAPI_strerror(ret1));
   exit(1);
}

// Nothing.

if ((ret2 = PAPI_read_counters(values, numEvents)) != PAPI_OK) {
    fprintf(stderr, "PAPI failed to read counters: %s\n", PAPI_strerror(ret2));
    exit(1);
}

This measurement will give you an estimate of the error that may occur due to PAPI itself.

Also, I don't think you need to use _mm_mfence.

answered on Stack Overflow Feb 12, 2019 by Hadi Brais • edited Feb 25, 2019 by Hadi Brais

User contributions licensed under CC BY-SA 3.0