[ntp:questions] Re: tinker step 0 (always slew) and kernel time discipline

Joe Harvell harvell at nortel.com
Fri Sep 22 20:11:19 UTC 2006


Richard B. Gilbert wrote:
> Joe Harvell wrote:
> 
>> David L. Mills wrote:
>> <snip>
<snip>
>> I concede that only having 2 NTP servers for our host made this 
>> problem more likely to occur.  But considering the mayhem caused by 
>> jerking the clock back and forth every 15 minues for 22 days, I think 
>> it is worth investigating whether to eliminate stepping altogether.
>>
> 
> Why didn't anyone notice the problem for 22 days?  If, indeed, it caused 
> mayhem, why was it allowed to continue for so long?

I see your point.  I don't know for sure if it really caused problems.  I suspect I will begin to see a large number of bug reports coming out of this test lab once they start filtering back to the design team.  But it is quite possible there weren't any big problems or they went unnoticed.  It really depends on the type of testing they were performing.  This application is a call processing application, implementing call signaling protocols, and a host of other proprietary protocols for OAM (Operations, Administration, Maintenance) of the software itself.  The big problems I would expect to have occurred fall into two categories:  1) problems stemming from protocol timers expiring both early and late; and 2) accounting records for the calls themselves showing inaccurate (including negative) duration.  The software that did notice the problem was the software responsible for journaling application state from one process to another, as part of a 1+1 fault tolerance system.  Th
is software was measuring round-trip latencies between it and its mate by bouncing a measurement from its own clock off of its mate and then re-sampling its own clock to see the RTT.  These RTT measurements only takes place during failure recovery scenarios, which is what was being tested at the time.

Since our customers are telecommunications service providers, I expect they would notice negative durations for their billing records.  I am trying to prevent this from ever occurring.  However, based on the response I've received from Dr. Mills in this thread, it seems like the daemon feedback loop is unstable as a result of OS developers implementing variable slew rates into adjtime.  So it looks like if we continue with NTP, the better choice is to use the kernel time discipline for stable time.  We will have to engineer the network so that multiple failures would be required to necessitate the stepping in the first place.

I wonder if it would be good to add a description to the NTP FAQ about this?  The key points to include I think should be why the kernel time discipline is disabled when the step threshold is changed, and also some indication that the daemon feedback loop is broken to begin with.  I am not the first person to go down this path.

Thanks again for your responses.

---
Joe Harvell




More information about the questions mailing list