[ntp:hackers] What to do when the offset is WAYTOOBIG

Brian Utterback brian.utterback at sun.com
Fri Apr 20 04:36:13 PDT 2007

What you describe is exactly what I am looking for. All too often the
program exits with only that single message in the syslog and is 
overlooked, and worse, not even available remotely.

I am concerned by your use of the phrase "dead in the water 
configuration rewrite code". During recent discussions, I was given
to understand that the config rewrite code was likely to be integrated
fairly soon. Is that no longer the case?

David L. Mills wrote:

> Most often, the operator is in an office down the cooridor and can ssh 
> to the victim machine and fix the problem. A really neat way to do this 
> is using the remote configuration features of the now deac in the water 
> configuration rewrite code. This could be used both to monitor operation 
> in real time as well as sent authenticated configuration commands in 
> real time.
> While not volunteering to do this myself, I propose a new option that, 
> if present, avoids killing the process and instead calls a designated 
> program. Also, a trap should be provided to alert the ntpq monitoring 
> program which itself can call a designated program if the performance is 
> out of bounds.
> All this of course should eventually be done using SNMP infrastructure; 
> however, upon closer inspection, SNMP is woefully inadequate with the 
> data types and trip wires commonly used by ntpd.
> Dave
> Brian Utterback wrote:
>> I agree with everything you said. As I said in my previous message,
>> you have convinced me that trying to monkey with the clock selection
>> is the wrong way to go, despite the fact that it has some nice
>> properties. So I agree that the only thing left is the decision to
>> step or fall on your sword. And as you note, the step threshold is
>> configurable, so there is nothing to discuss if you want to step.
>> However, I think that there is one more item on the agenda, namely
>> whether or not it would be a good idea to fall on your sword or
>> merely run and hide and call for help.
>> So, my proposal is that instead of exiting with an error message,
>> we do not step the clock, we do print an error message, and we
>> mark the clock as insane (or otherwise stop sending out the time)
>> We should ensure that there is a way to remotely set the allow_panic
>> variable if the admin decides that is the way to fix things. I am
>> happy to make the choice about whether or not to exit configurable
>> as well.
>> So, Dave, I guess the question I have is there a real technical reason
>> why this is a bad idea? Is there a scenario where the behavior I
>> am proposing behaves worse than the current behavior? We know that
>> there are some common scenarios where it has more desirable behavior.
>> David L. Mills wrote:
>>> Brian,
>>> There are two separable issue to consider here, the clock selection 
>>> algorithm and the panic threshold.
>>> The clock selection algorithm was designed only after considerable 
>>> discusion, both in the commercial community (DEC) and in the computer 
>>> science theory community. As has been noted, it is based from the 
>>> model described in Keith Marzullo's dissertation and discussions at 
>>> the Dagstuhl Conference in southern Germany. I say this to emphasize 
>>> this model has been extensively vetted by a bunch of guys I trust.
>>> The absolute bedrock method in the design is to find the best 
>>> majority subset of clocks that agree within some interval based on 
>>> delay. There is strong theoretical and practical evidence that the 
>>> true UTC is somewhere in the middle of what I call the intersection 
>>> interval developed by the algorithm. In my example, the conditions 
>>> are the same before the clock is set at 1200 and after the clock has 
>>> been set manually. The three 1200 clocks remain the truechimers and 
>>> the two 1300 clocks remain the falsetickers. The fact the 
>>> falsetickers are sbove or below the panic threshold is not 
>>> significant. The only time the panic threshold is important is at the 
>>> time the clock is to be set.
>>> Notwithstanding the above, the important issue is whether to step the 
>>> clock, wait for better times or call for (presumably) human 
>>> intervention. I purposely chose a scenario where it was necessary to 
>>> choose betwen two alternative cliques, the members of which were 
>>> close together while the cliques themselves were far apart. It could 
>>> be the nearest clique is within the panic threshold, but the 
>>> selection algorithm considers that clique falseticker by the rules. 
>>> Thus, the only thing remainin is whether to step to the truechimer 
>>> clique or panic.
>>> In my example there is no credible way the two cliques come together 
>>> left by themselves unless one of them is manually stepped. Should 
>>> either clique be rescued, it joins the other clique and things get 
>>> well. So, the only question remaining is when the panic threshold is 
>>> exceeded, whether to fall on your sword or step. I submit this is a 
>>> nonstarter to argue. You get to set the panic interval anywhere you 
>>> like, perhaps 30 years might be appropriate.
>>> Dave
>>> Brian Utterback wrote:
>>>> Actually, your scenario is a good reason why it may not be a good
>>>> idea to mark clocks that are outside the limit as insane and ignore
>>>> them. If we were to ignore the three that say 1200, then we would only
>>>> have the two that say 1200, right at the limit. So we step the clock to
>>>> 1200. Now we have all five available and since three say 1200, we step
>>>> the clock to 1200, effectively circumventing the panic limit.
>>>> All I am saying is that if you exit there is only one recourse, to
>>>> manually restart. The problem could be permanent, the problem could
>>>> be transient. In either case, somebody needs to log on the system and
>>>> restart the daemon.
>>>> On the other hand, if you instead stop serving time but don't exit,
>>>> then if the problem is transient then no intervention is required. If
>>>> it is permanent but fixable upstream, again no intervention is 
>>>> required.
>>>> If it is permanent and local (I'm thinking somebody set the local 
>>>> clock)
>>>> then it might be fixed by resetting allow_panic (can that be done 
>>>> remotely? With the new config stuff?). And finally, it might still
>>>> require a local login, but that would have happened either way.
>>>> No matter how I slice it, it seems better to me to stay alive and
>>>> hopeful even if those hopes are dashed, then to commit suicide. If
>>>> you stop serving the time downstream, then the effect on the NTP
>>>> network is the same either way, but by staying alive you can allow
>>>> remote diagnosis and keep calling for help periodically.
>>>> David L. Mills wrote:
>>>>> Brian,
>>>>> I am watching five clocks. Three of the say 1200, two say 1300 and 
>>>>> my clock says 1400. Since the majority of clocks I watch say 1200, 
>>>>> I conclude the real time is 1220, but that is beyond my panic limit 
>>>>> of one hour. Should I wait until things "get better"? I think not. 
>>>>> I could make the panic limit over two hours and things would get 
>>>>> better real quick. Or, I could use the -x option. so the first 
>>>>> panic would be forgiven and my clock would read 1200. If after that 
>>>>> a warp occurs over 1000 s relative to the majority clique, there 
>>>>> may be a stuck bit in the hardware clock (that's happened) and I 
>>>>> need to jump the train right away.
>>>>> Dave
>>>>> Brian Utterback wrote:
>>>>>> But is this a valid characterization? And even if it is mostly 
>>>>>> true, what harm is there in waiting to see if it gets better. I 
>>>>>> think Judah has the right idea, namely if the going get tough, 
>>>>>> just sit down, shut
>>>>>> up and pretend that you don't exist until things get better. That is,
>>>>>> go ahead and yell, don't step the clock but don't serve time in case
>>>>>> you might be off, but be willing to start up again if things get 
>>>>>> better
>>>>>> later. This seems like the best of both worlds.
>>>>>> David L. Mills wrote:
>>>>>>> The philosophical basis of this design is very carefully 
>>>>>>> considered in the book. However, the simple characterization of 
>>>>>>> the panic threshold is that if exceeded, it will not get better 
>>>>>>> no matter how long you wait.


"Remember 'A Thousand Points of Light'? With a network, we now have
a thousand points of failure."
Brian Utterback - Solaris RPE, Sun Microsystems, Inc.
Ph:877-259-7345, Em:brian.utterback-at-ess-you-enn-dot-kom

More information about the hackers mailing list