Skip to content
  • Andriy Gapon's avatar
    kern_tc: unify timecounter to bintime delta conversion · 3d9d64aa
    Andriy Gapon authored
    There are two places where we convert from a timecounter delta to
    a bintime delta: tc_windup and bintime_off.
    Both functions use the same calculations when the timecounter delta is
    small.  But for a large delta (greater than approximately an equivalent
    of 1 second) the calculations were different.  Both functions use
    approximate calculations based on th_scale that avoid division.  Both
    produce values slightly greater than a true value, calculated with
    division by tc_frequency, would be.  tc_windup is slightly more
    accurate, so its result is closer to the true value and, thus, smaller
    than bintime_off result.
    
    As a consequence there can be a jump back in time when time hands are
    switched after a long period of time (a large delta).  Just before the
    switch the time would be calculated with a large delta from
    th_offset_count in bintime_off.  tc_windup does the switch using its own
    calculations of a new th_offset using the large delta.  As explained
    earlier, the new th_offset may end up being less than the previously
    produced binuptime.  So, for a period of time new binuptime values may
    be "back in time" comparing to values just before the switch.
    
    Such a jump must never happen.  All the code assumes that the uptime is
    monotonically nondecreasing and some code works incorrectly when that
    assumption is broken.  For example, we have observed sleepq_timeout()
    ignoring a timeout when the sbinuptime value obtained by the callout
    code was greater than the expiration value, but the sbinuptime obtained
    in sleepq_timeout() was less than it.  In that case the target thread
    would never get woken up.
    
    The unified calculations should ensure the monotonic property of the
    uptime.
    
    The problem is quite rare as normally tc_windup should be called HZ
    times per second (typically 1000 or 100).  But it may happen in VMs on
    very busy hypervisors where a VM's virtual CPU may not get an execution
    time slot for a second or more.
    
    Reviewed by:	kib
    MFC after:	2 weeks
    Sponsored by:	Panzura LLC
    3d9d64aa