[Please scroll down to "The actual (stackoverflow) problem" if you just want to see what I have a problem with]
Consider the following delay loop code,
using clock_type = std::chrono::steady_clock;
std::atomic<uint64_t> m_word;
constexpr uint64_t tokens_mask = 0xffffffff;
double measure_delay_loop_in_ms(int delay_loop_size, int inner_loop_size)
{
auto start = clock_type::now();
// The actual delay loop under test.
for (int i = delay_loop_size; i != 0; --i)
{
if ((m_word.load(std::memory_order_relaxed) & tokens_mask) != 0)
break;
for (int j = inner_loop_size; j != 0; --j)
asm volatile ("");
}
// End of actual delay loop test code.
auto stop = clock_type::now();
return std::chrono::duration<double, std::milli>(stop - start).count();
}
Where m_word
is an std::atomic<uint64_t>
that is equal to zero for the entire duration of the test: the break
is never executed. tokens_mask
is a constexpr uint64_t
with the least significant 32 bits set. Not that this precise choice of code is relevant for this question; it is just a choice that I made.
The resulting assembly code then is, for example (here using g++ 8.3.0 and -O2):
.L5:
mov rax, QWORD PTR m_word[rip] // m_word & tokens_mask
test eax, eax // != 0
jne .L2 // break;
test ebp, ebp // inner_loop_size
je .L3 // != 0
mov eax, ebp // j = inner_loop_size
.L4:
sub eax, 1 // --j
jne .L4 // j != 0
.L3:
sub ebx, 1 // --i
jne .L5 // i != 0
.L2:
The time that this loop takes depends on many things:
1) the values of delay_loop_size and inner_loop_size.
2) the hardware it is running on (e.g. the CPU frequency).
3) the compiler / optimization used.
4) the recent history of the CPU (e.g. cache and execution pipeline).
5) whether or not the delay loop is interrupted while it is running.
Because of points 2) and 3) I want to calibrate the delay loop at the start of the program, aka during run time.
What is required is a good estimate for the delay as function of
delay_loop_size
and inner_loop_size
with a linear approximation.
That is, we need a function
double delay_estimate_in_ms(int delay_loop_size, int inner_loop_size)
{
return offset + delay_loop_size * (alpha + beta * inner_loop_size);
}
where the double's offset
, alpha
and beta
need to be determined at the start of the program doing measurements with measure_delay_loop_in_ms
.
That this approximation is a good one follows from the fact that I'm not interested in outliers as a result of point 5). Then, assuming that variations as a result of point 4) will average out, aka can be considered normal 'noise', it is likely that the number of (average) clock cycles will be linear in inner_loop_size
for fixed delay_loop_size
. Likewise, the number of average clock cycles will be linear in delay_loop_size
when we keep inner_loop_size fixed
. This then leads to the above formula.
Assuming an accurate measurement we could find offset
by setting delay_loop_size
and inner_loop_size
to zero, and subsequently finding alpha and beta with two more measurements:
offset = measure_delay_loop_in_ms(0, 0);
alpha = (measure_delay_loop_in_ms(10000, 0) - offset) / 10000;
beta = ((measure_delay_loop_in_ms(10000, 10000) - offset) / 10000 - alpha) / 10000;
This won't work for obvious reasons. At the very least we'd have to do many measurements of each to get rid of outliers and take some average. Then this simply isn't the right way to find a best linear fit. And finally, as it is unknown how long each delay loop takes, it might take a long time to determine beta this way (that is, those values of 10000 are arbitrary and might not be very practical when you want to a swift calibration at the start of every run of the program). Finally, it is not unthinkable that our estimate of linearity only holds as a good approximation around sane values, aka the values that we will use eventually.
Therefore the following must be taken into account:
I have reasons for this, but they fall outside the scope of this question (it has to do with bus contention and response time to wake up this spinner thread by changing m_word
).
The above translates in the following strategy to find optimal values of delay_loop_size and inner_loop_size:
delay_loop_size
with a high accuracy around 1 ms. delay_loop_size
between 10000 and 100000. If so then beta can be determined trivially from (enough) measurements with that fixed value of delay_loop_size
and values of inner_loop_size
of 9, 10 and 11. So, assume we're done. delay_loop_size
is under 10000, fix it at 10000 and find a good linear approximation for inner_loop_size
(which will be now we less than 10, so you might even just do a quick binary search on those few values). delay_loop_size
is over 100000, then fix it at 100000 and find good linear approximation for the delay as function of inner_loop_size
with a high accuracy around 1 ms. Then fix inner_loop_size
at the corresponding value for 1 ms, but rounded up and find a good linear approximation for the delay as function of delay_loop_size
around 1 ms. A crucial part of this strategy is find a good linear approximation for x around y.
Hence, consider we are given a constant goal
.
double const goal = 1.0; // 1.0 is just an example.
And a function measure(int x)
.
double measure(int x);
Avoid calling measure
for arbitrary values of x
because a larger value of x
leads to longer delays. However, we need to find a value of x
such that measure(x)
returns a value as close to goal
as possible most of the time. Sometimes measure
will return an arbitrary large value, but it may be assumed that this happens less than 1% of the time. If this happens, it may be assumed that those measurements can be recognized by the fact that they lay outside the 2 sigma range. Hence, we need to find the best linear fit for measure
as function of x
around goal
, including the sigma of the (assumed) normal distributed values while ignoring those values that lay too far above this line.
Note that in fact it is NOT a normal distribution - and really anything that approximates a 'spread' is good enough - just as long as it detects the outliers well. I'd strongly prefer to fit a line simply through the 'center of mass' of measurements, that is: the further away from the average the less interesting they are (the less weight should be applied). Points below the line could be measured towards an error by taking their distance to the line (as opposed to taking the square as is usual), the same for points above the line, but then with some cut off where they are discarded completely.
In that case there is no normal distribution so also no sigma. One could simply write a program that returns the values of a
and b
such that y = a * x + b
is a good approximation, and a cutoff value.
It could be disaster if even one outlier is used to estimate the linear approximation however; simply ignoring the top 10% of the measurements won't do (I'd like to be 99.999999% sure that approximation is correct to within a factor of 2, and using a single random outlier could cause that).
An additional problem is that I'm not feeling very certain about how linear the actual measurements will be (I'd think they have to be though). A definite problem however is when one would do a few measurements far away from the goal, for obvious reasons, but does that mean we should also disregard measurements that lay far away from the goal? Probably not - it might be more useful to have an approximation that works over a larger range then one that is more accurate around the goal... Nevertheless I'd really like to have measurements on both sides of the goal ;).
How to write this calibration code?
EDIT
Here is an image of a RANSAC like approach where all colored points marked "errXYZ" have been removed one by one (starting with err0 .. err27) as "outliers".
But then there is also this kind of measurement:
Ok, I'm done with my algorithm... A case like the above is now treated like this:
As you can see, this fits a line to the faster delay loop case (the orange diamonds) while ignoring the single SUPER fast(???) measurement of 0.4 ms with a loop size of 50000 (the black triangle). Most points have been ignored one by one because they single handedly shifted the OLS slope too much, but the (many) green X's have been cut off for a different reason; they were part of a fitted line together with the orange diamonds, but the orange diamond were laying more than 2% (in slope) below that fitted line! In that case I cut off all points that more than 2% above the line (and refit a new line).
User contributions licensed under CC BY-SA 3.0