[ntp:questions] Re: tinker step 0 (always slew) and kernel time discipline

Richard B. Gilbert rgilbert88 at comcast.net
Fri Sep 22 20:55:20 UTC 2006

Joe Harvell wrote:

> Richard B. Gilbert wrote:
>> Joe Harvell wrote:
>>> David L. Mills wrote:
>>> <snip>
> <snip>
>>> I concede that only having 2 NTP servers for our host made this 
>>> problem more likely to occur.  But considering the mayhem caused by 
>>> jerking the clock back and forth every 15 minues for 22 days, I think 
>>> it is worth investigating whether to eliminate stepping altogether.
>> Why didn't anyone notice the problem for 22 days?  If, indeed, it 
>> caused mayhem, why was it allowed to continue for so long?
> I see your point.  I don't know for sure if it really caused problems.  
> I suspect I will begin to see a large number of bug reports coming out 
> of this test lab once they start filtering back to the design team.  But 
> it is quite possible there weren't any big problems or they went 
> unnoticed.  It really depends on the type of testing they were 
> performing.  This application is a call processing application, 
> implementing call signaling protocols, and a host of other proprietary 
> protocols for OAM (Operations, Administration, Maintenance) of the 
> software itself.  The big problems I would expect to have occurred fall 
> into two categories:  1) problems stemming from protocol timers expiring 
> both early and late; and 2) accounting records for the calls themselves 
> showing inaccurate (including negative) duration.  The software that did 
> notice the problem was the software responsible for journaling 
> application state from one process to another, as part of a 1+1 fault 
> tolerance system.  Th
> is software was measuring round-trip latencies between it and its mate 
> by bouncing a measurement from its own clock off of its mate and then 
> re-sampling its own clock to see the RTT.  These RTT measurements only 
> takes place during failure recovery scenarios, which is what was being 
> tested at the time.
> Since our customers are telecommunications service providers, I expect 
> they would notice negative durations for their billing records.  I am 
> trying to prevent this from ever occurring.  However, based on the 
> response I've received from Dr. Mills in this thread, it seems like the 
> daemon feedback loop is unstable as a result of OS developers 
> implementing variable slew rates into adjtime.  So it looks like if we 
> continue with NTP, the better choice is to use the kernel time 
> discipline for stable time.  We will have to engineer the network so 
> that multiple failures would be required to necessitate the stepping in 
> the first place.
> I wonder if it would be good to add a description to the NTP FAQ about 
> this?  The key points to include I think should be why the kernel time 
> discipline is disabled when the step threshold is changed, and also some 
> indication that the daemon feedback loop is broken to begin with.  I am 
> not the first person to go down this path.
> Thanks again for your responses.
> ---
> Joe Harvell

The telephone companies tend to be very aware of time and timing.  The 
time division multiplexing of T1 and T3 lines requires splitting the 
second very precisely.  Cellular phones also require very precise 
timing.  I would expect them to be able to provide your application with 
an ultra accurate and ultra stable time time signal of some sort.

I think you are okay as long as the customer knows that the time must 
NOT step.

Agree about the documentation; there are many things that are not as 
well documented as they might be.  OTOH, as long as you configure and 
operate ntpd as designed, it tends to work very well.

More information about the questions mailing list