I spent many time to measure exact clock cycles of given instructions, a portion of code written in C. However, I never could measure exactly how many cycles will take during the runtime, I used PAPI, "rtdsc" hardware counter, clock(), clock_gettime with "CLOCK_REALTIME". However, for every run, I get different numbers, (I don't want to iterate the program over 100000 times and get the average, is not working in my case! even works, the average still is not the precise way in my opinion),
I'm surprised how there is not a precise measurement tool, since we know exactly how many instructions are there, how is possible that we don't have a proper tool to measure!
I profile with oprofile and vtune(which inside uses pref), I know exactly what is going on but how I can measure exactly it? I'm afraid there is no way!
For the sample code below, a call to the Fibonacci function, no way to get the exact number of cycles!
#include <stdio.h>
#include <time.h>
#include <math.h>
#include <stdint.h>
#ifdef PAPI
#include <papi.h>
#endif
uint64_t fib(int n) { return n < 2 ? (uint64_t)n : fib(n-1) + fib(n-2); }
#define rdtscll(val) { \
unsigned int __a,__d; \
asm volatile("rdtsc" : "=a" (__a), "=d" (__d)); \
(val) = ((unsigned long)__a) | (((unsigned long)__d)<<32); \
}
int main(int argc, char **argv)
{
struct timespec tv;
long long start,stop;
#ifdef PAPI
// gcc test.c -I/${PAPI_DIR}/include -L/${PAPI_DIR}/lib -O3 -o test -lpapi
if (PAPI_library_init(PAPI_VER_CURRENT) != PAPI_VER_CURRENT)
exit(1);
start = PAPI_get_real_cyc();
#ifdef math
fib(10);
#endif
stop = PAPI_get_real_cyc();
printf("total cycles : %lld\n",stop-start);
#endif
#ifdef clk
start = clock();
#ifdef math
fib(10);
#endif
stop = clock();
printf("total cycles : %lld\n",stop-start);
#endif
#ifdef Wtime
clock_gettime(CLOCK_REALTIME, &tv);
start= (tv.tv_sec) * 1000000000 + (tv.tv_nsec);
#ifdef math
fib(10);
#endif
clock_gettime(CLOCK_REALTIME, &tv);
stop= (tv.tv_sec) * 1000000000 + (tv.tv_nsec);
printf("total time : %lld\n",stop-start);
#endif
#ifdef hardware_counter
rdtscll(start);
#ifdef math
fib(10);
#endif
rdtscll(stop);
printf("total cycles : %lld\n",stop-start);
#endif
return 0;
}
I put here the objdump output for some of them, for instance rtdsc:
0000000000400450 <main>:
400450: 48 83 ec 08 sub $0x8,%rsp
400454: 0f 31 rdtsc
400456: 89 c0 mov %eax,%eax
400458: 48 c1 e2 20 shl $0x20,%rdx
40045c: 48 09 c2 or %rax,%rdx
40045f: b8 09 00 00 00 mov $0x9,%eax
400464: 49 89 d3 mov %rdx,%r11
400467: 83 f8 01 cmp $0x1,%eax
40046a: 7e 5c jle 4004c8 <main+0x78>
40046c: 44 8d 48 fe lea -0x2(%rax),%r9d
400470: 44 8d 50 fd lea -0x3(%rax),%r10d
400474: 44 8d 40 ff lea -0x1(%rax),%r8d
400478: 44 89 c8 mov %r9d,%eax
40047b: 83 e0 fe and $0xfffffffe,%eax
40047e: 41 29 c2 sub %eax,%r10d
400481: 44 89 c7 mov %r8d,%edi
400484: e8 37 01 00 00 callq 4005c0 <xfib>
400489: 41 83 e8 02 sub $0x2,%r8d
40048d: 45 39 d0 cmp %r10d,%r8d
400490: 75 ef jne 400481 <main+0x31>
400492: 41 83 f9 ff cmp $0xffffffff,%r9d
400496: 44 89 c8 mov %r9d,%eax
400499: 75 cc jne 400467 <main+0x17>
40049b: 0f 31 rdtsc
40049d: 89 c0 mov %eax,%eax
40049f: 48 c1 e2 20 shl $0x20,%rdx
4004a3: be 24 0a 40 00 mov $0x400a24,%esi
4004a8: 48 09 c2 or %rax,%rdx
4004ab: bf 01 00 00 00 mov $0x1,%edi
4004b0: 31 c0 xor %eax,%eax
4004b2: 4c 29 da sub %r11,%rdx
4004b5: e8 76 ff ff ff callq 400430 <__printf_chk@plt>
4004ba: 31 c0 xor %eax,%eax
4004bc: 48 83 c4 08 add $0x8,%rsp
4004c0: c3 retq
4004c1: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
4004c8: 44 8d 48 fe lea -0x2(%rax),%r9d
4004cc: 41 83 f9 ff cmp $0xffffffff,%r9d
4004d0: 44 89 c8 mov %r9d,%eax
4004d3: 75 92 jne 400467 <main+0x17>
4004d5: eb c4 jmp 40049b <main+0x4b>
4004d7: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)
4004de: 00 00
What is the way to measure exactly and consistently how many clock cycles a computer uses between a portion of code with no calls to another library, stdout, etc,?
I appreciate any help,
EDIT: Thanks all for your comments, regarding the framework, I use x86_64 and aarch64 on physical machines running ubuntu 16.04 and ubuntu 18.04. In PAPI case on aarch64 could not passed the validation tests so I skipped PAPI. However, I tried also this topic using "perf" but the same results, perf cycle counter -> https://stackoverflow.com/a/64898073/3101659
I think this is related to the fundamental of computer architecture, in von Neuman machines, using "random memory access", "brach prediction" and other stuff is the reason that we can not predict and achieve the same cycles in our program. If I run this program using HLS on FPGA device I can measure exactly how many cycles uses by the device cycle by cycle, wheras, on conventioal machines like von Neuman based machines, this is not possible.
User contributions licensed under CC BY-SA 3.0