Commit 3d9d64aa authored by Andriy Gapon's avatar Andriy Gapon
Browse files

kern_tc: unify timecounter to bintime delta conversion

There are two places where we convert from a timecounter delta to
a bintime delta: tc_windup and bintime_off.
Both functions use the same calculations when the timecounter delta is
small.  But for a large delta (greater than approximately an equivalent
of 1 second) the calculations were different.  Both functions use
approximate calculations based on th_scale that avoid division.  Both
produce values slightly greater than a true value, calculated with
division by tc_frequency, would be.  tc_windup is slightly more
accurate, so its result is closer to the true value and, thus, smaller
than bintime_off result.

As a consequence there can be a jump back in time when time hands are
switched after a long period of time (a large delta).  Just before the
switch the time would be calculated with a large delta from
th_offset_count in bintime_off.  tc_windup does the switch using its own
calculations of a new th_offset using the large delta.  As explained
earlier, the new th_offset may end up being less than the previously
produced binuptime.  So, for a period of time new binuptime values may
be "back in time" comparing to values just before the switch.

Such a jump must never happen.  All the code assumes that the uptime is
monotonically nondecreasing and some code works incorrectly when that
assumption is broken.  For example, we have observed sleepq_timeout()
ignoring a timeout when the sbinuptime value obtained by the callout
code was greater than the expiration value, but the sbinuptime obtained
in sleepq_timeout() was less than it.  In that case the target thread
would never get woken up.

The unified calculations should ensure the monotonic property of the
uptime.

The problem is quite rare as normally tc_windup should be called HZ
times per second (typically 1000 or 100).  But it may happen in VMs on
very busy hypervisors where a VM's virtual CPU may not get an execution
time slot for a second or more.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	Panzura LLC
parent 1b0602f2
......@@ -213,6 +213,23 @@ tc_delta(struct timehands *th)
tc->tc_counter_mask);
}
static __inline void
bintime_add_tc_delta(struct bintime *bt, uint64_t scale,
uint64_t large_delta, uint64_t delta)
{
uint64_t x;
if (__predict_false(delta >= large_delta)) {
/* Avoid overflow for scale * delta. */
x = (scale >> 32) * delta;
bt->sec += x >> 32;
bintime_addx(bt, x << 32);
bintime_addx(bt, (scale & 0xffffffff) * delta);
} else {
bintime_addx(bt, scale * delta);
}
}
/*
* Functions for reading the time. We have to loop until we are sure that
* the timehands that we operated on was not updated under our feet. See
......@@ -224,7 +241,7 @@ bintime_off(struct bintime *bt, u_int off)
{
struct timehands *th;
struct bintime *btp;
uint64_t scale, x;
uint64_t scale;
u_int delta, gen, large_delta;
do {
......@@ -238,15 +255,7 @@ bintime_off(struct bintime *bt, u_int off)
atomic_thread_fence_acq();
} while (gen == 0 || gen != th->th_generation);
if (__predict_false(delta >= large_delta)) {
/* Avoid overflow for scale * delta. */
x = (scale >> 32) * delta;
bt->sec += x >> 32;
bintime_addx(bt, x << 32);
bintime_addx(bt, (scale & 0xffffffff) * delta);
} else {
bintime_addx(bt, scale * delta);
}
bintime_add_tc_delta(bt, scale, large_delta, delta);
}
#define GETTHBINTIME(dst, member) \
do { \
......@@ -1392,17 +1401,8 @@ tc_windup(struct bintime *new_boottimebin)
#endif
th->th_offset_count += delta;
th->th_offset_count &= th->th_counter->tc_counter_mask;
while (delta > th->th_counter->tc_frequency) {
/* Eat complete unadjusted seconds. */
delta -= th->th_counter->tc_frequency;
th->th_offset.sec++;
}
if ((delta > th->th_counter->tc_frequency / 2) &&
(th->th_scale * delta < ((uint64_t)1 << 63))) {
/* The product th_scale * delta just barely overflows. */
th->th_offset.sec++;
}
bintime_addx(&th->th_offset, th->th_scale * delta);
bintime_add_tc_delta(&th->th_offset, th->th_scale,
th->th_large_delta, delta);
/*
* Hardware latching timecounters may not generate interrupts on
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment