Runtime delay loop calibration


Introduction to the underlaying problem

[Please scroll down to "The actual (stackoverflow) problem" if you just want to see what I have a problem with]

Consider the following delay loop code,

using clock_type = std::chrono::steady_clock;
std::atomic<uint64_t> m_word;
constexpr uint64_t tokens_mask = 0xffffffff;

double measure_delay_loop_in_ms(int delay_loop_size, int inner_loop_size)
  auto start = clock_type::now();
  // The actual delay loop under test.
  for (int i = delay_loop_size; i != 0; --i)
    if ((m_word.load(std::memory_order_relaxed) & tokens_mask) != 0)
    for (int j = inner_loop_size; j != 0; --j)
      asm volatile ("");
  // End of actual delay loop test code.
  auto stop = clock_type::now();
  return std::chrono::duration<double, std::milli>(stop - start).count();

Where m_word is an std::atomic<uint64_t> that is equal to zero for the entire duration of the test: the break is never executed. tokens_mask is a constexpr uint64_t with the least significant 32 bits set. Not that this precise choice of code is relevant for this question; it is just a choice that I made.

The resulting assembly code then is, for example (here using g++ 8.3.0 and -O2):

        mov     rax, QWORD PTR m_word[rip]  // m_word & tokens_mask
        test    eax, eax                    // != 0
        jne     .L2                         // break;
        test    ebp, ebp                    // inner_loop_size
        je      .L3                         // != 0
        mov     eax, ebp                    // j = inner_loop_size
        sub     eax, 1                      // --j
        jne     .L4                         // j != 0
        sub     ebx, 1                      // --i
        jne     .L5                         // i != 0

The time that this loop takes depends on many things:

1) the values of delay_loop_size and inner_loop_size.
2) the hardware it is running on (e.g. the CPU frequency).
3) the compiler / optimization used.
4) the recent history of the CPU (e.g. cache and execution pipeline).
5) whether or not the delay loop is interrupted while it is running.

Because of points 2) and 3) I want to calibrate the delay loop at the start of the program, aka during run time.

What is required is a good estimate for the delay as function of delay_loop_size and inner_loop_size with a linear approximation.

That is, we need a function

double delay_estimate_in_ms(int delay_loop_size, int inner_loop_size)
  return offset + delay_loop_size * (alpha + beta * inner_loop_size);

where the double's offset, alpha and beta need to be determined at the start of the program doing measurements with measure_delay_loop_in_ms.

That this approximation is a good one follows from the fact that I'm not interested in outliers as a result of point 5). Then, assuming that variations as a result of point 4) will average out, aka can be considered normal 'noise', it is likely that the number of (average) clock cycles will be linear in inner_loop_size for fixed delay_loop_size. Likewise, the number of average clock cycles will be linear in delay_loop_size when we keep inner_loop_size fixed. This then leads to the above formula.

Assuming an accurate measurement we could find offset by setting delay_loop_size and inner_loop_size to zero, and subsequently finding alpha and beta with two more measurements:

offset = measure_delay_loop_in_ms(0, 0);
alpha = (measure_delay_loop_in_ms(10000, 0) - offset) / 10000;
beta = ((measure_delay_loop_in_ms(10000, 10000) - offset) / 10000 - alpha) / 10000;

This won't work for obvious reasons. At the very least we'd have to do many measurements of each to get rid of outliers and take some average. Then this simply isn't the right way to find a best linear fit. And finally, as it is unknown how long each delay loop takes, it might take a long time to determine beta this way (that is, those values of 10000 are arbitrary and might not be very practical when you want to a swift calibration at the start of every run of the program). Finally, it is not unthinkable that our estimate of linearity only holds as a good approximation around sane values, aka the values that we will use eventually.

Therefore the following must be taken into account:

  • The delay loop must tuned around approximately 1 ms (or whatever value you wish, but consider it to be a fixed value; we can investigate afterwards if offset, alpha and beta depend on it very much or not).
  • delay_loop_size must be between 10000 and 100000 if at all possible (if not then set inner_loop_size to zero and just find the best fit). If possible, inner_loop should be larger than 10, for values larger than 10 it is preferred to have a larger delay_loop_size.

I have reasons for this, but they fall outside the scope of this question (it has to do with bus contention and response time to wake up this spinner thread by changing m_word).

The above translates in the following strategy to find optimal values of delay_loop_size and inner_loop_size:

  1. Set inner_loop_size to 10 and find a good linear fit as function of delay_loop_size with a high accuracy around 1 ms.
  2. Check that this results in a value of delay_loop_size between 10000 and 100000. If so then beta can be determined trivially from (enough) measurements with that fixed value of delay_loop_size and values of inner_loop_size of 9, 10 and 11. So, assume we're done.
  3. delay_loop_size is under 10000, fix it at 10000 and find a good linear approximation for inner_loop_size (which will be now we less than 10, so you might even just do a quick binary search on those few values).
  4. If delay_loop_size is over 100000, then fix it at 100000 and find good linear approximation for the delay as function of inner_loop_size with a high accuracy around 1 ms. Then fix inner_loop_size at the corresponding value for 1 ms, but rounded up and find a good linear approximation for the delay as function of delay_loop_size around 1 ms.
  5. Calculate offset, alpha and beta from the previously found linear approximations.

A crucial part of this strategy is find a good linear approximation for x around y.

The actual (stackoverflow) problem

Hence, consider we are given a constant goal.

double const goal = 1.0;  // 1.0 is just an example.

And a function measure(int x).

double measure(int x);

Avoid calling measure for arbitrary values of x because a larger value of x leads to longer delays. However, we need to find a value of x such that measure(x) returns a value as close to goal as possible most of the time. Sometimes measure will return an arbitrary large value, but it may be assumed that this happens less than 1% of the time. If this happens, it may be assumed that those measurements can be recognized by the fact that they lay outside the 2 sigma range. Hence, we need to find the best linear fit for measure as function of x around goal, including the sigma of the (assumed) normal distributed values while ignoring those values that lay too far above this line.

Note that in fact it is NOT a normal distribution - and really anything that approximates a 'spread' is good enough - just as long as it detects the outliers well. I'd strongly prefer to fit a line simply through the 'center of mass' of measurements, that is: the further away from the average the less interesting they are (the less weight should be applied). Points below the line could be measured towards an error by taking their distance to the line (as opposed to taking the square as is usual), the same for points above the line, but then with some cut off where they are discarded completely.

In that case there is no normal distribution so also no sigma. One could simply write a program that returns the values of a and b such that y = a * x + b is a good approximation, and a cutoff value.

It could be disaster if even one outlier is used to estimate the linear approximation however; simply ignoring the top 10% of the measurements won't do (I'd like to be 99.999999% sure that approximation is correct to within a factor of 2, and using a single random outlier could cause that).

An additional problem is that I'm not feeling very certain about how linear the actual measurements will be (I'd think they have to be though). A definite problem however is when one would do a few measurements far away from the goal, for obvious reasons, but does that mean we should also disregard measurements that lay far away from the goal? Probably not - it might be more useful to have an approximation that works over a larger range then one that is more accurate around the goal... Nevertheless I'd really like to have measurements on both sides of the goal ;).

How to write this calibration code?


Here is an image of a RANSAC like approach where all colored points marked "errXYZ" have been removed one by one (starting with err0 .. err27) as "outliers".

Outer loop size vs delay (with ils=10)

But then there is also this kind of measurement:

Same as above, different run

Ok, I'm done with my algorithm... A case like the above is now treated like this:

Final algorithm

As you can see, this fits a line to the faster delay loop case (the orange diamonds) while ignoring the single SUPER fast(???) measurement of 0.4 ms with a loop size of 50000 (the black triangle). Most points have been ignored one by one because they single handedly shifted the OLS slope too much, but the (many) green X's have been cut off for a different reason; they were part of a fitted line together with the orange diamonds, but the orange diamond were laying more than 2% (in slope) below that fitted line! In that case I cut off all points that more than 2% above the line (and refit a new line).

asked on Stack Overflow Oct 4, 2019 by Carlo Wood • edited Oct 12, 2019 by Carlo Wood

0 Answers

Nobody has answered this question yet.

User contributions licensed under CC BY-SA 3.0