[ntp:questions] Problems with NTP with slewing large time offsets
no at spam.com
Sun Nov 13 18:04:20 UTC 2005
Once our application system is running, we cannot allow the time to step-it
must only slew. But due to network and power issues, sometimes ntpdate
cannot reach the timeserver to step the time before the application comes
up, and the local time may be off by several seconds. NTP is designed to
slew the time with offsets less than 128ms, and step the time with larger
offsets. It does not easily handle slewing when the offset is several tens
of seconds. With the (old) ntp that came with Redhat, the time offset
swings wildly positive and negative, overshooting past zero in both
directions but never coming close to converging even after many days.
If there is a large initial time offset what we would like to happen is for
the local time to monotonically slew until the offset is near zero and then
have the kernel ntp_adjtime take over and accurately keep the offset near
The version I'm using is ntp-dev-4.2.0b-20051108, on Redhat9 Linux kernel
2.4.25 with kernel support of ntp_adjtime. When testing with a starting
offset of 30 seconds, ntp sometimes works well (slewing at the desired rate
of .5ms/sec), sometimes slews at a rate 100 times slower, and sometimes
stubbornly refusing to slew at all.
The presence or absence of the ntp.drift file does not make any repeatable
difference. Rather than using the "-x" option, we have a "tinker step 0" in
ntp.conf. This sets the value of clock_max (which is the step vs. slew
threshold) to something other than the default value of 0.128msec. Tinker
step 0 sets clock_max to zero, which means to never step and only slew.
Some code analysis:
In loop_config/LOOP_DRIFTINIT (ntp_loopfilter.c line#892), if clock_max is >
0.128, then the kernel support of ntp_adjtime is not initialized. Since we've
set clock_max to 0, this does happen, though. Pll_control gets set to 1, to
In local_clock (line# 541) (called when a poll from the timeserver is
received), if pll_control and kern_enable are both set, and the offset is
less that 0.5 seconds, then ntp_adjtime is called to begin to slew the time.
In our case, pll_control & kern_enable are both set, but since the offset is
much larger than 0.5 sec the kernel ntp_adjtime function is not called.
Adj_host_clock (line # 793) is called every second. If pll_control &
kern_enable aren't set, then it calls adj_systime to make the kernel do a
slew. Otherwise it assumes that ntp_adjtime is being used to slew the time,
so it returns immediately.
These interactions are the source of our problem.
Tinker step 0 tells it to use the kernel ntp_adjtime support.
Therefore adj_host_clock won't slew the time.
But when the offset is large, local_clock ever refuse to use ntp_adjtime to
slew the time.
We really want to never step the time, and we really want for ntp_adjtime to
control the local time.
My experimental fix is to add a flag so that adj_host_clock will slew the
time if local_clock did not because of the offset being too large. When the
offset finally gets small enough, ntp_adjtime will take over.
In testing, even with an initial offset of 30 seconds, the time offset gets
handled reliably and repeatablely.
Any comments or ideas for other directions to look? I'll post my patch
More information about the questions