[ntp:questions] Re: tinker step 0 (always slew) and kernel time discipline

user at domain.invalid user at domain.invalid
Mon Sep 25 18:22:29 UTC 2006


Joe,

First of all, you misunderstand what the prefer keyword is for and what 
it is intended to do. It is not applicable to your scenario. As for 
jerking back and forth every 15 minutes, something is seriously broken 
with the hardware, either a stuck bit or kernel problem. Consider the 
step actions as a temporal canary. Considering the rather large number 
of servers around here and the national labs, if a step ever occurs, the 
hardware is to blame.

Second, you apparently are using two servers that diverge widely about 
their times. The clients will be most confused as to which of the 
servers is trustable. This is not a step problem, it is a fatal 
condition for the applications. If the divergence is due to configuring 
both servers with the local clock driver, this violates the principle 
that all servers cling to the same timescale, UTC or synthetic. If you 
really need to have redundant servers that cling to the same synthetic 
timescale, configure both servers in orphan mode and symmetric active 
mode with each other. Do not use the local clock driver.

A better choice is to have three servers configured as above. If one of 
them sails to the sunset, a majority clique is still possible. If only 
two servers and one of them sails away, the clients cannot form a 
majoity clique and will conclude neither of them is sane.

Above all, if you are serious about the integrity of the time function 
and believe in Lamport's happens-before relation, as interpreted by NTP, 
take very seriously the topics discussed in the white papers linked from 
the NTP project page. Also, there should be no excuse for not detecting 
and responding to a scenario where servers can show serious disagreement 
without being reported to your beeper. That's how the NIST servers are 
monitored.

Dave

Joe Harvell wrote:
> David L. Mills wrote:
> <snip>
> 
>>
>> 5. If for some reason the server(s) are not reachable at startup and 
>> the applications must start, then I would assume the applications 
>> would fail, since the time is not synchronized. If the applications 
>> use the NTP system primatives, the synchronization condition is 
>> readily apparent in the return code. Since they can't run anyway, 
>> there is no harm in stepping the clock, no matter what the initial 
>> offset. Forcing a slew in this case would seem highly undesirable, 
>> unless the application can tolerate large differences between clocks 
>> and, in that case, using ntpd is probably a poor choice in the first 
>> place.
>>
> 
> I agree that the condition of no time servers reachable on startup is 
> the most common case where a large offset will eventually be observed.  
> I agree that the application should detect this and fail before starting 
> up.  I am concerned about clock and network failure scenarios that cause 
> an NTP client to see two different NTP servers with very different times.
> 
> This actually happened in a testbed for our application. NTP stats show 
> that over the course of 22 days, the offsets of two configured NTP 
> servers (both ours) serving one of our NTP clients started diverging up 
> to a maximum distance of 800 seconds.  During this time, our NTP client 
> stepped its clock forward 940 times and backwards 803 times, with 
> increasing magnitudes up to ~400 seconds.  The problem went away when 
> someone "added an IP address to the configuration of one of the NTP 
> servers."  (I am still trying to determine exactly what happened).  The 
> ntp.conf files of the NTP client, the stats, and a nice graph of the 
> offsets is found at http://dingo.dogpad.net/ntpProblem/.
> 
> I concede that only having 2 NTP servers for our host made this problem 
> more likely to occur.  But considering the mayhem caused by jerking the 
> clock back and forth every 15 minues for 22 days, I think it is worth 
> investigating whether to eliminate stepping altogether.
> 
> I still don't understand why the clock was being stepped back and 
> forth.  One of the NTP servers showed up with 80f4 (unreachable) status 
> every 15 minutes for the entire 22 days, but with 90f4 (reject) and 96f4 
> (sys.peer) in between.  Oddly, this server was one of two servers, but 
> the *other* server was the preferred peer.  I wonder why this peer would 
> ever be selected as the sys.peer since the prefer peer was only reported 
> unreachable 10 times over this 22 day period.  Would this be because the 
> selection algorithm finds no intersection?
> 
> Maybe the behavior I saw was a bug, and not the expected consequence of 
> a failure scenario in which 2 NTP servers have diverging clocks.




More information about the questions mailing list