[ntp:questions] Odd (mis)behavior when reference clock fails
oberman at es.net
Tue Sep 23 15:26:47 UTC 2008
> From: Martin Burnicki <martin.burnicki at meinberg.de>
> Date: Tue, 23 Sep 2008 09:34:06 +0200
> Sender: questions-bounces+oberman=es.net at lists.ntp.org
> Unruh wrote:
> > hundoj at comcast.net (Rob Neal) writes:
> >>On Tue, 16 Sep 2008, Kevin Oberman wrote:
> >>> We have a fairly large "mesh" of NTP servers spread across the
> >>> US. Almost all have PPS reference clocks and are quite
> >>> accurate. Recently one of the reference clocks located across the county
> >>> seems to have failed. Such is life.
> >>> The problem is that the system's time started drifting and eventually
> >>> became far enough out of sync with the mesh to be marked as a bad
> >>> ticker.
> >>> The only way I could get the clock to slew or step the time was to edit
> >>> the configuration and comment out the reference clock and PPS. It looks
> >>> like the system will only use the time from a reference clock when and
> >>> if the clock is configured, even if it can't be read.
> >>> Is there any way to "fix" this?
> >> What is it that you consider broken? Please clarify.
> >> I've re-read this several times, and don't see the problem.
> >> A reference clock broke. It was disregarded because it chimed
> >> badly.
> >> You expected something different?
> > A hardware clock broke. The computer which was using that hardware clock
> > insisted on using that hardware clock even though it gave no time. It
> > acted as a server, and eventually its time drifted so badly everyone else
> > saw it as a bad chimer.
> > It seems to have had other server lines in the /etc/ntp.conf, but ignored
> > them in favour of a non-working refclock.
> > That is how I interpret what he said, but I may be wrong as well.
> This is also how I understand this.
> Maybe the problem occurred because either the refclock did not report its
> failure state correctly, or ntpd's refclock driver did not pass the fail
> state on to the NTP kernel, so the refclock was not discarded after it
> It would be helpful to know the exact NTP version, and which hardware clock
> and refclock driver was used.
It's 4.2.4p4 running on FreeBSD 7.0. The reference clock is a EndRun
Tech CDMA clock using the TrueTime driver. When the system was running,
ntpq claimed no successful polls of the reference clock or the PPS. It
was getting good responses from other systems, but not syncing to
them. The offset started small after the clock failed, about .003, and
steadily grew to over 5 ms. The reference clock always showed a zero
reachability, delay and offset and .001 jitter.
Here is my configuration:
server 127.127.5.1 prefer minpoll 4 maxpoll 4
fudge 127.127.5.1 refid CDMA
fudge 127.127.5.1 time1 .011
server 127.127.22.1 minpoll 4 maxpoll 4
fudge 127.127.22.1 flag3 1
peer time1-owamp.es.net iburst key 2
peer time2-owamp.es.net iburst key 2
peer time3-owamp.es.net iburst key 2
peer time4-owamp.es.net iburst key 2
peer time5-owamp.es.net iburst key 2
peer time6-owamp.es.net iburst key 2
peer time7-owamp.es.net iburst key 2
peer time8-owamp.es.net iburst key 2
peer time9-owamp.es.net iburst key 2
peer time10-owamp.es.net iburst key 2
peer time11-owamp.es.net iburst key 2
peer time12-owamp.es.net iburst key 2
All peers are identical systems with CDMA clocks. All are firewalled so
that they are not publicly visible.
Here is the ntpq -p output after restoring the reference clock to the
config and letting it run for a few minutes. Drift is already
# ntpq -p
remote refid st t when poll reach delay offset jitter
TRUETIME(1) .CDMA. 0 l - 16 0 0.000 0.000 0.001
PPS(1) .PPS. 0 l - 16 0 0.000 0.000 0.001
-time1-owamp.es. .PPS. 1 u 17 64 177 2.058 -10.335 0.038
*time2-owamp.es. .PPS. 1 u 49 64 177 24.556 -10.408 0.020
-time3-owamp.es. .PPS. 1 u 63 64 176 55.640 -10.337 0.049
+time4-owamp.es. .PPS. 1 u 59 64 176 20.770 -10.405 0.058
+time5-owamp.es. .PPS. 1 u 45 64 177 23.907 -10.406 0.014
-time6-owamp.es. .PPS. 1 u 46 64 177 14.790 -10.340 0.062
-time7-owamp.es. .PPS. 1 u 50 64 73 25.160 -10.381 0.022
-time8-owamp.es. .PPS. 1 u 27 64 177 27.378 -10.388 0.054
-time9-owamp.es. .PPS. 1 u 43 64 177 75.571 -10.118 0.067
+time10-owamp.es .PPS. 1 u 47 64 177 24.068 -10.401 0.048
-time11-owamp.es .PPS. 1 u 35 64 177 74.542 -10.314 0.035
-time12-owamp.es .PPS. 1 u 49 64 176 7.224 -10.361 0.036
R. Kevin Oberman, Network Engineer
Energy Sciences Network (ESnet)
Ernest O. Lawrence Berkeley National Laboratory (Berkeley Lab)
E-mail: oberman at es.net Phone: +1 510 486-8634
Key fingerprint:059B 2DDF 031C 9BA3 14A4 EADA 927D EBB3 987B 3751
More information about the questions