[ntp:questions] 2 NTP Servers with diverging clocks and how to avoid stepping backwards in time (repost)
jharvell at dogpad.net
Tue Sep 19 18:49:48 UTC 2006
I am doing post-mortem analysis on an NTP related problem in which one
host running ntp-4.1.2 gets in a state where it seems to be making large
step corrections to its local clock.
When I look at the NTP stats file, I can see that something was terribly
wrong with one or more of the NTP servers this host was using. Sometime
around 18 August, the clocks of NTP servers 192.168.0.1 and 192.168.0.2
began to gradually diverge reaching a difference of over 800 seconds by
8 September. Compounding this problem, the peerstats also shows one of
the NTP servers periodically (period of ~900s) being detected as
unreachable over the whole duration. The other NTP server had a few
sporadic incidences of being unreachable.
I have captured all of the ntp configuration and the stats files. Also,
I prepared a graph
(http://dingo.dogpad.net/ntpProblem/reachableScatter.png) showing the
offset of each peer as a function of time. All the stats and config
(and the graph) can be found at http://dingo.dogpad.net/ntpProblem.
I am a little bit interested in understanding what could have happened
with the NTP servers on 18 August. I know that on 8 September, someone
changed the configuration of one of the NTP servers (Note: the servers
are probably not ntp.org's implementation), which apparently fixed the
I am more interested, however, how the my node handled this problem.
Before I started digging into the problem, I was under the impression
that ntp.org's ntpd never stepped the clock, but only slewed it to
correct it. Now I see this is not the default behavior, bu I can
achieve this using tinker step 0. However, I read a thread on this
newsgroup from Feb 2005 in which David Mills suggested this could
produce large offsets and other unpredictable errors.
How can I avoid the large clock stepping in this scenario? Is it
related to the "prefer" keyword used for 192.168.0.1?
Can I safely use "tinker step 0" along with "kernel disable" to prevent
step corrections altogether?
Can anyone tell me what they think happened to cause the two NTP servers
to diverge so quickly?
More information about the questions