[ntp:questions] Re: 2 NTP Servers with diverging clocks and how to avoid stepping backwards in time (repost)

Richard B. Gilbert rgilbert88 at comcast.net
Tue Sep 19 19:27:34 UTC 2006


Joseph Harvell wrote:
> I am doing post-mortem analysis on an NTP related problem in which one
> host running ntp-4.1.2 gets in a state where it seems to be making large
> step corrections to its local clock.
> 
> When I look at the NTP stats file, I can see that something was terribly
> wrong with one or more of the NTP servers this host was using.  Sometime
> around 18 August, the clocks of NTP servers 192.168.0.1 and 192.168.0.2
> began to gradually diverge reaching a difference of over 800 seconds by
> 8 September.  Compounding this problem, the peerstats also shows one of
> the NTP servers periodically (period of ~900s) being detected as
> unreachable over the whole duration.  The other NTP server had a few
> sporadic incidences of being unreachable.
> 
> I have captured all of the ntp configuration and the stats files.  Also,
> I prepared a graph
> (http://dingo.dogpad.net/ntpProblem/reachableScatter.png) showing the
> offset of each peer as a function of time.  All the stats and config
> (and the graph) can be found at http://dingo.dogpad.net/ntpProblem.
> 
> I am a little bit interested in understanding what could have happened
> with the NTP servers on 18 August.  I know that on 8 September, someone
> changed the configuration of one of the NTP servers (Note: the servers
> are probably not ntp.org's implementation), which apparently fixed the
> problem.
> 
> I am more interested, however, how the my node handled this problem.
> Before I started digging into the problem, I was under the impression
> that ntp.org's ntpd never stepped the clock, but only slewed it to
> correct it.  Now I see this is not the default behavior, bu I can
> achieve this using tinker step 0.  However, I read a thread on this
> newsgroup from Feb 2005 in which David Mills suggested this could
> produce large offsets and other unpredictable errors.

Ntpd will step the clock if the error exceeds 128ms but is less than 
1024 seconds.  If the error is greater than 1024 seconds it declares the 
situation hopeless and commits suicide.

> 
> How can I avoid the large clock stepping in this scenario?  Is it
> related to the "prefer" keyword used for 192.168.0.1?
> Can I safely use "tinker step 0" along with "kernel disable" to prevent
> step corrections altogether?

Safely??  Probably not!!!!  Far better to fix the problem, whatever it 
might be.

If you configure four servers and one fails somehow (wrong time, crash, 
etc.)  ntpd will happily continue with the remaining three servers.  If 
you configure five servers, two can fail without ill effect.  Two 
servers is the worst possible configuration; when the two differ, as 
they inevitably will, ntpd has no means of determining which one is more 
nearly correct!  Three servers degenerates too easily to the two server 
case!

One usual cause of persistent stepping on Linux systems is the local 
clock being updated 1000 times per second instead of 100 (kernel 
parameter HZ needs to be set to 100).  The other usual cause of 
persistent stepping is a local clock frequency error greater then 500 
parts per million.  The only cure for this is to repair or replace the 
local clock (usually means replacing the mother board).

ntp-4.1.2 is well behind the current stable version.  Upgrade and take 
advantage of the fixes and new features.




More information about the questions mailing list