non consistent clock cycles with the measurement tools, where exactly this inconsistent cycles come from?


I spent many time to measure exact clock cycles of given instructions, a portion of code written in C. However, I never could measure exactly how many cycles will take during the runtime, I used PAPI, "rtdsc" hardware counter, clock(), clock_gettime with "CLOCK_REALTIME". However, for every run, I get different numbers, (I don't want to iterate the program over 100000 times and get the average, is not working in my case! even works, the average still is not the precise way in my opinion),

I'm surprised how there is not a precise measurement tool, since we know exactly how many instructions are there, how is possible that we don't have a proper tool to measure!

I profile with oprofile and vtune(which inside uses pref), I know exactly what is going on but how I can measure exactly it? I'm afraid there is no way!

For the sample code below, a call to the Fibonacci function, no way to get the exact number of cycles!

#include <stdio.h>
#include <time.h>
#include <math.h>
#include <stdint.h>

#ifdef PAPI
#include <papi.h>

uint64_t fib(int n) { return n < 2 ? (uint64_t)n : fib(n-1) + fib(n-2); }

#define rdtscll(val) {                                           \
    unsigned int __a,__d;                                        \
    asm volatile("rdtsc" : "=a" (__a), "=d" (__d));              \
    (val) = ((unsigned long)__a) | (((unsigned long)__d)<<32);   \

int main(int argc, char **argv)
    struct timespec tv;
    long long start,stop;
#ifdef PAPI
    //  gcc test.c -I/${PAPI_DIR}/include -L/${PAPI_DIR}/lib -O3 -o test -lpapi
    if (PAPI_library_init(PAPI_VER_CURRENT) != PAPI_VER_CURRENT)
    start = PAPI_get_real_cyc();
    #ifdef math
    stop = PAPI_get_real_cyc();
    printf("total cycles : %lld\n",stop-start);
#ifdef clk
    start = clock();
    #ifdef math
    stop = clock();
    printf("total cycles : %lld\n",stop-start);
#ifdef Wtime
    clock_gettime(CLOCK_REALTIME, &tv);
    start= (tv.tv_sec) * 1000000000 + (tv.tv_nsec);
    #ifdef math
    clock_gettime(CLOCK_REALTIME, &tv);
    stop= (tv.tv_sec) * 1000000000 + (tv.tv_nsec);
    printf("total time : %lld\n",stop-start);
#ifdef hardware_counter
    #ifdef math
    printf("total cycles : %lld\n",stop-start);
    return 0;

I put here the objdump output for some of them, for instance rtdsc:

0000000000400450 <main>:
  400450:   48 83 ec 08             sub    $0x8,%rsp
  400454:   0f 31                   rdtsc  
  400456:   89 c0                   mov    %eax,%eax
  400458:   48 c1 e2 20             shl    $0x20,%rdx
  40045c:   48 09 c2                or     %rax,%rdx
  40045f:   b8 09 00 00 00          mov    $0x9,%eax
  400464:   49 89 d3                mov    %rdx,%r11
  400467:   83 f8 01                cmp    $0x1,%eax
  40046a:   7e 5c                   jle    4004c8 <main+0x78>
  40046c:   44 8d 48 fe             lea    -0x2(%rax),%r9d
  400470:   44 8d 50 fd             lea    -0x3(%rax),%r10d
  400474:   44 8d 40 ff             lea    -0x1(%rax),%r8d
  400478:   44 89 c8                mov    %r9d,%eax
  40047b:   83 e0 fe                and    $0xfffffffe,%eax
  40047e:   41 29 c2                sub    %eax,%r10d
  400481:   44 89 c7                mov    %r8d,%edi
  400484:   e8 37 01 00 00          callq  4005c0 <xfib>
  400489:   41 83 e8 02             sub    $0x2,%r8d
  40048d:   45 39 d0                cmp    %r10d,%r8d
  400490:   75 ef                   jne    400481 <main+0x31>
  400492:   41 83 f9 ff             cmp    $0xffffffff,%r9d
  400496:   44 89 c8                mov    %r9d,%eax
  400499:   75 cc                   jne    400467 <main+0x17>
  40049b:   0f 31                   rdtsc  
  40049d:   89 c0                   mov    %eax,%eax
  40049f:   48 c1 e2 20             shl    $0x20,%rdx
  4004a3:   be 24 0a 40 00          mov    $0x400a24,%esi
  4004a8:   48 09 c2                or     %rax,%rdx
  4004ab:   bf 01 00 00 00          mov    $0x1,%edi
  4004b0:   31 c0                   xor    %eax,%eax
  4004b2:   4c 29 da                sub    %r11,%rdx
  4004b5:   e8 76 ff ff ff          callq  400430 <__printf_chk@plt>
  4004ba:   31 c0                   xor    %eax,%eax
  4004bc:   48 83 c4 08             add    $0x8,%rsp
  4004c0:   c3                      retq   
  4004c1:   0f 1f 80 00 00 00 00    nopl   0x0(%rax)
  4004c8:   44 8d 48 fe             lea    -0x2(%rax),%r9d
  4004cc:   41 83 f9 ff             cmp    $0xffffffff,%r9d
  4004d0:   44 89 c8                mov    %r9d,%eax
  4004d3:   75 92                   jne    400467 <main+0x17>
  4004d5:   eb c4                   jmp    40049b <main+0x4b>
  4004d7:   66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
  4004de:   00 00 

What is the way to measure exactly and consistently how many clock cycles a computer uses between a portion of code with no calls to another library, stdout, etc,?

I appreciate any help,

EDIT: Thanks all for your comments, regarding the framework, I use x86_64 and aarch64 on physical machines running ubuntu 16.04 and ubuntu 18.04. In PAPI case on aarch64 could not passed the validation tests so I skipped PAPI. However, I tried also this topic using "perf" but the same results, perf cycle counter ->

I think this is related to the fundamental of computer architecture, in von Neuman machines, using "random memory access", "brach prediction" and other stuff is the reason that we can not predict and achieve the same cycles in our program. If I run this program using HLS on FPGA device I can measure exactly how many cycles uses by the device cycle by cycle, wheras, on conventioal machines like von Neuman based machines, this is not possible.

asked on Stack Overflow Apr 17, 2021 by AminS • edited Apr 18, 2021 by AminS

0 Answers

Nobody has answered this question yet.

User contributions licensed under CC BY-SA 3.0