How to measure elapsed time on ARM Cortex-M4 processor in C?

2

I'm using a STM32F429 with ARM Cortex-M4 processor. I premise that I don't know the assembly of ARM, but I need to optimize the code. I read the solution of

How to measure program execution time in ARM Cortex-A8 processor?

that is that I need, but that solution is for Cortex-A8. For a whim, I tried to implement the code of link above on my code but I obtain a SEGV in this point:

if (enable_divider)
    value |= 8;     // enable "by 64" divider for CCNT.

  value |= 16;

  // program the performance-counter control-register:
  asm volatile ("MCR p15, 0, %0, c9, c12, 0\t\n" :: "r"(value));  /*<---Here I have SEGV error*/

  // enable all counters:  
  asm volatile ("MCR p15, 0, %0, c9, c12, 1\t\n" :: "r"(0x8000000f));  

  // clear overflows:
  asm volatile ("MCR p15, 0, %0, c9, c12, 3\t\n" :: "r"(0x8000000f));

How can I adjust this assembly code to perform on ARM Cortex-M4?

c
optimization
assembly
time
arm
asked on Stack Overflow Nov 23, 2014 by Anth • edited May 23, 2017 by Community

1 Answer

1

Ditch the Cortex-A8 method.

This is the correct way to do it for most Cortex-M based microcontrollers (do not use SysTick!):

  1. Set up a timer, which runs at the same speed as the CPU.
  2. Do not attach an interrupt to the timer.
  3. Poll the timer value by using a single LDR instruction before you start your measuring.
  4. Execute a NOP instruction, then run the code you want to measure.
  5. Execute a NOP instruction, then poll the timer value by using a single LDR instruction when you end your measuring.

The NOP instructions are for accuracy, in order to make sure the pipelining does not disturb your results. This is necessary on the Cortex-M3, because one LDR instruction takes two clock cycles. Two contiguous LDR instructions can be pipelined, so they take only 3 clock cycles total. See the Cortex-M4 Technical Reference Manual at the ARM Information Center, for more information on the instruction set timing.

Of course, you should run your code from internal SRAM, in order to make sure it's not slowed down by the slow Flash memory.

I cannot guarantee that this will be 100% cycle-accurate on all devices, but it should get very close. (See Chris' comment below). You should also know that this is intended to be used in an environment with no interrupts.

answered on Stack Overflow Nov 23, 2014 by (unknown user) • edited Nov 27, 2018 by Venemo

User contributions licensed under CC BY-SA 3.0