[ntp:questions] Problems with NTP with slewing large time offsets

rayvt no at spam.com
Sun Nov 13 18:04:20 UTC 2005


Once our application system is running, we cannot allow the time to step-it 
must only slew.  But due to network and power issues, sometimes ntpdate 
cannot reach the timeserver to step the time before the application comes 
up, and the local time may be off by several seconds.   NTP is designed to 
slew the time with offsets less than 128ms, and step the time with larger 
offsets.  It does not easily handle slewing when the offset is several tens 
of seconds.  With the (old) ntp that came with Redhat, the time offset 
swings wildly positive and negative, overshooting past zero in both 
directions but never coming close to converging even after many days.



If there is a large initial time offset what we would like to happen is for 
the local time to monotonically slew until the offset is near zero and then 
have the kernel ntp_adjtime take over and accurately keep the offset near 
zero.



The version I'm using is ntp-dev-4.2.0b-20051108, on Redhat9 Linux kernel 
2.4.25 with kernel support of  ntp_adjtime.  When testing with a starting 
offset of 30 seconds, ntp sometimes works well (slewing at the desired rate 
of .5ms/sec), sometimes slews at a rate 100 times slower, and sometimes 
stubbornly refusing to slew at all.



The presence or absence of the ntp.drift file does not make any repeatable 
difference.  Rather than using the "-x" option, we have a "tinker step 0" in 
ntp.conf.   This sets the value of clock_max (which is the step vs. slew 
threshold) to something other than the default value of 0.128msec.  Tinker 
step 0 sets clock_max to zero, which means to never step and only slew.



Some code analysis:

In loop_config/LOOP_DRIFTINIT (ntp_loopfilter.c line#892), if clock_max is > 
0.128, then the kernel support of ntp_adjtime is not initialized.  Since we've 
set clock_max to 0, this does happen, though.  Pll_control gets set to 1, to 
so indicate.



In local_clock (line# 541) (called when a poll from the timeserver is 
received), if pll_control and kern_enable are both set, and the offset is 
less that 0.5 seconds, then ntp_adjtime is called to begin to slew the time. 
In our case, pll_control & kern_enable are both set, but since the offset is 
much larger than 0.5 sec the kernel ntp_adjtime function is not called.



Adj_host_clock (line # 793) is called every second.  If pll_control & 
kern_enable aren't set, then it calls adj_systime to make the kernel do a 
slew.  Otherwise it assumes that ntp_adjtime is being used to slew the time, 
so it returns immediately.



These interactions are the source of our problem.

Tinker step 0 tells it to use the kernel ntp_adjtime support.

Therefore adj_host_clock won't slew the time.

But when the offset is large, local_clock ever refuse to use ntp_adjtime to 
slew the time.



We really want to never step the time, and we really want for ntp_adjtime to 
control the local time.



My experimental fix is to add a flag so that adj_host_clock will slew the 
time if local_clock did not because of the offset being too large.  When the 
offset finally gets small enough, ntp_adjtime will take over.

In testing, even with an initial offset of 30 seconds, the time offset gets 
handled reliably and repeatablely.



Any comments or ideas for other directions to look?  I'll post my patch 
soon.













More information about the questions mailing list