[ntp:questions] ntpd wedged again

Dave Hart hart at ntp.org
Fri Feb 10 08:39:08 UTC 2012

On Fri, Feb 10, 2012 at 03:07, A C <agcarver+ntp at acarver.net> wrote:
> So far so good, it's running with reasonable offsets and jitter (PPS not yet
> enabled).
> But this is new to me in the logs:
> 10 Feb 01:41:37 ntpd[242]: ts_min 1328838097.185664676 ts_prev
> 1328838097.185550676 ts 1328838097.189745353
> 10 Feb 01:41:37 ntpd[242]: sys_fuzz 114000 nsec, this fuzz 0.000099410
> 10 Feb 01:41:37 ntpd[242]: get_systime prev result 0xd2def451.30cccb9b is
> 0.000780166 later than 0xd2def451.3099aa99

This is from recently-introduced code in ntp-dev that is reporting a
hoped-to-be-impossible situation of reading the clock using ntpd
routine get_systime() where the fuzzed result is earlier than the
previous get_systime()'s result.  These messages are not really
intended to be scrutable by those not familiar with the internals of
that routine, and my hope was they would not be seen at all.

Briefly, ntpd "fuzzes" each timestamp's bits below sys_fuzz.  The
display is inconveniently mixed between decimal and hexadecimal
representations, as those were the forms at hand and I didn't want to
try to convert them and risk bugs in such conversion confusing
results.  Until recently, the fuzzing was below the reported
precision, or minimum increment observed between subsequent clock
readings.  That should be no problem on typically high-precision unix
system clocks, but led to predictable "time went backwards" issues
with low-precision clocks such as Windows ntpd using the native system
clock ticking every 0.5 or 1.0 msec.

The new code measures not only the minimum increment between
subsequent readings, but also the minimum time to read the clock,
which is not the same with a low-precision system clock.  If ntpd
reads the system clock at the same value 100 times in a row, then it
increments by 1 msec, the code infers the time to read the clock is 1
msec /  100 or 10 usec, and sets sys_fuzz to 10 usec.  Later when
reading the system clock and converting to NTP 32:32 fixed-point
representation, a random value up to 10 usec will be added to the
value from the OS, where previously up to 1000 usec / 1 msec would be

At the same time fuzzing approach was changed, paranoid checks to be
sure both the OS clock and the converted and fuzzed 32:32 is always
advancing as well as the fuzzed conversion always increasing.  So
either those paranoid checks are wrong, or the code to measure the
fuzz at startup is wrong, leading to the logically impossible result
of the clock appearing to move backward relative to the prior reading.

In this specific case, the minimum time to read the clock was measured
at ntpd startup to be 114 usec, so each raw OS clock reading is
expected to be no less than 114 usec later than the prior.  The first
line mentioning three ts_ variable names is displaying the raw clock
readings from the OS, which are ordered as expected and not less than
114 usec apart, so far so good.  The second line shows the measured
minimum time to read the clock which sets the fuzz threshold, and the
randomly selected fuzz value for this reading, which is a random
number between 0 and sys_fuzz (114 usec).  The last line shows the
resulting fuzzed clock reading is 780 usec earlier than the prior
fuzzed reading, which should not be possible.

This code is green and could well be wrong.  I encourage others seeing
such messages in syslog or a configured ntpd log file to share a few
examples here or in a bug report.

Those who followed the introduction of a high-precision clock into
FreeBSD many years ago might recall similar OS-level warnings
reporting that time appeared to run backwards, seen for example when
using processor cycle counters on SMP systems where the counters are
not synchronized across processors.  I hope these messages will be
tamed by a bug fix and the new paranoid checks can continue to be
used, with more comprehensible warning text when the checks fail.

Dave Hart

Dave Hart

More information about the questions mailing list