[ntp:questions] Re: tinker step 0 (always slew) and kernel time discipline

David Woolley david at djwhome.demon.co.uk
Fri Sep 22 21:00:27 UTC 2006


In article <ef19v5$kqn$1 at zcars129.ca.nortel.com>,
Joe Harvell <harvell at nortel.com> wrote:

> This actually happened in a testbed for our application. NTP stats show
* that over the course of 22 days, the offsets of two configured NTP
* servers (both ours) serving one of our NTP clients started diverging
* up to a maximum distance of 800 seconds.  During this time, our NTP

This could only happen if either the implementation was broken, or
they were mis-using the local clock pseudo reference clock.  If the
servers were using a proper reference clock as their primary source,
root dispersion would have exceeded it's maximum value when the
error was probably a lot less than a second and the servers would have
been rejected completely.

Configuring a local clock breaks this process, so should never be done
as default (even though distributors like doing this).  In many cases,
it is best not to have a local reference clock configured at all.  If you
do have more than one configured, you should arrange make each server have
a different stratum, with steps of two between them, so that there is a
a well defined priority amongst the different machines.

If you don't have any real reference clocks in the overall network, it
is even more important that there is normally only one possible choice
of local clock reference.  Having two local clock references that are
diverging violates the fundamental principle that all NTP times are 
traceable to a single (and preferably UTC) time.

* client stepped its clock forward 940 times and backwards 803 times,
* with increasing magnitudes up to ~400 seconds.  The problem went away
* when someone "added an IP address to the configuration of one of the
* NTP servers."  (I am still trying to determine exactly what happened).

That sounds like that server had a local reference clock as 
fallback.

* The ntp.conf files of the NTP client, the stats, and a nice graph of
* the offsets is found at http://dingo.dogpad.net/ntpProblem/.

> I concede that only having 2 NTP servers for our host made this problem
* more likely to occur.  But considering the mayhem caused by jerking the
* clock back and forth every 15 minues for 22 days, I think it is worth
* investigating whether to eliminate stepping altogether.

15 minutes sounds like the verification before ntpd becomes convinced
that its time really is seriously wrong.

> I still don't understand why the clock was being stepped back and forth.
* One of the NTP servers showed up with 80f4 (unreachable) status every
* 15 minutes for the entire 22 days, but with 90f4 (reject) and 96f4
* (sys.peer) in between.  Oddly, this server was one of two servers,
* but the *other* server was the preferred peer.  I wonder why this peer

Normally, I believe, if you have just two servers and they have non-
intersecting error bounds, they will both be rejected and the 
system will free run.  However, I think that prefer confuses the issue,
by not allowing the preferred one to be discarded.  I have a feeling
this is actually done by saying that the system stops discarding when
it would discard that one.  I suppose that the other one could still
be in contention at that point.

* would ever be selected as the sys.peer since the prefer peer was only
* reported unreachable 10 times over this 22 day period.  Would this be
* because the selection algorithm finds no intersection?

> Maybe the behavior I saw was a bug, and not the expected consequence of
* a failure scenario in which 2 NTP servers have diverging clocks.

The expected behaviour is that this has happened because one is giving
a false time and the other is giving UTC time.  The remaining servers
will also give UTC time, so the bad one will get voted out.

I don't think prefer is intended to deal with broken clocks, only with
more accurate ones.




More information about the questions mailing list