[ntp:hackers] What to do when the offset is WAYTOOBIG

Brian Utterback brian.utterback at sun.com
Thu Apr 19 11:21:24 PDT 2007


I agree with everything you said. As I said in my previous message,
you have convinced me that trying to monkey with the clock selection
is the wrong way to go, despite the fact that it has some nice
properties. So I agree that the only thing left is the decision to
step or fall on your sword. And as you note, the step threshold is
configurable, so there is nothing to discuss if you want to step.
However, I think that there is one more item on the agenda, namely
whether or not it would be a good idea to fall on your sword or
merely run and hide and call for help.

So, my proposal is that instead of exiting with an error message,
we do not step the clock, we do print an error message, and we
mark the clock as insane (or otherwise stop sending out the time)
We should ensure that there is a way to remotely set the allow_panic
variable if the admin decides that is the way to fix things. I am
happy to make the choice about whether or not to exit configurable
as well.

So, Dave, I guess the question I have is there a real technical reason
why this is a bad idea? Is there a scenario where the behavior I
am proposing behaves worse than the current behavior? We know that
there are some common scenarios where it has more desirable behavior.

David L. Mills wrote:
> Brian,
> 
> There are two separable issue to consider here, the clock selection 
> algorithm and the panic threshold.
> 
> The clock selection algorithm was designed only after considerable 
> discusion, both in the commercial community (DEC) and in the computer 
> science theory community. As has been noted, it is based from the model 
> described in Keith Marzullo's dissertation and discussions at the 
> Dagstuhl Conference in southern Germany. I say this to emphasize this 
> model has been extensively vetted by a bunch of guys I trust.
> 
> The absolute bedrock method in the design is to find the best majority 
> subset of clocks that agree within some interval based on delay. There 
> is strong theoretical and practical evidence that the true UTC is 
> somewhere in the middle of what I call the intersection interval 
> developed by the algorithm. In my example, the conditions are the same 
> before the clock is set at 1200 and after the clock has been set 
> manually. The three 1200 clocks remain the truechimers and the two 1300 
> clocks remain the falsetickers. The fact the falsetickers are sbove or 
> below the panic threshold is not significant. The only time the panic 
> threshold is important is at the time the clock is to be set.
> 
> Notwithstanding the above, the important issue is whether to step the 
> clock, wait for better times or call for (presumably) human 
> intervention. I purposely chose a scenario where it was necessary to 
> choose betwen two alternative cliques, the members of which were close 
> together while the cliques themselves were far apart. It could be the 
> nearest clique is within the panic threshold, but the selection 
> algorithm considers that clique falseticker by the rules. Thus, the only 
> thing remainin is whether to step to the truechimer clique or panic.
> 
> In my example there is no credible way the two cliques come together 
> left by themselves unless one of them is manually stepped. Should either 
> clique be rescued, it joins the other clique and things get well. So, 
> the only question remaining is when the panic threshold is exceeded, 
> whether to fall on your sword or step. I submit this is a nonstarter to 
> argue. You get to set the panic interval anywhere you like, perhaps 30 
> years might be appropriate.
> 
> Dave
> 
> Brian Utterback wrote:
>> Actually, your scenario is a good reason why it may not be a good
>> idea to mark clocks that are outside the limit as insane and ignore
>> them. If we were to ignore the three that say 1200, then we would only
>> have the two that say 1200, right at the limit. So we step the clock to
>> 1200. Now we have all five available and since three say 1200, we step
>> the clock to 1200, effectively circumventing the panic limit.
>>
>> All I am saying is that if you exit there is only one recourse, to
>> manually restart. The problem could be permanent, the problem could
>> be transient. In either case, somebody needs to log on the system and
>> restart the daemon.
>>
>> On the other hand, if you instead stop serving time but don't exit,
>> then if the problem is transient then no intervention is required. If
>> it is permanent but fixable upstream, again no intervention is required.
>> If it is permanent and local (I'm thinking somebody set the local clock)
>> then it might be fixed by resetting allow_panic (can that be done 
>> remotely? With the new config stuff?). And finally, it might still
>> require a local login, but that would have happened either way.
>>
>> No matter how I slice it, it seems better to me to stay alive and
>> hopeful even if those hopes are dashed, then to commit suicide. If
>> you stop serving the time downstream, then the effect on the NTP
>> network is the same either way, but by staying alive you can allow
>> remote diagnosis and keep calling for help periodically.
>>
>> David L. Mills wrote:
>>
>>> Brian,
>>>
>>> I am watching five clocks. Three of the say 1200, two say 1300 and my 
>>> clock says 1400. Since the majority of clocks I watch say 1200, I 
>>> conclude the real time is 1220, but that is beyond my panic limit of 
>>> one hour. Should I wait until things "get better"? I think not. I 
>>> could make the panic limit over two hours and things would get better 
>>> real quick. Or, I could use the -x option. so the first panic would 
>>> be forgiven and my clock would read 1200. If after that a warp occurs 
>>> over 1000 s relative to the majority clique, there may be a stuck bit 
>>> in the hardware clock (that's happened) and I need to jump the train 
>>> right away.
>>>
>>> Dave
>>>
>>> Brian Utterback wrote:
>>>
>>>> But is this a valid characterization? And even if it is mostly true, 
>>>> what harm is there in waiting to see if it gets better. I think 
>>>> Judah has the right idea, namely if the going get tough, just sit 
>>>> down, shut
>>>> up and pretend that you don't exist until things get better. That is,
>>>> go ahead and yell, don't step the clock but don't serve time in case
>>>> you might be off, but be willing to start up again if things get better
>>>> later. This seems like the best of both worlds.
>>>>
>>>> David L. Mills wrote:
>>>>
>>>>>
>>>>> The philosophical basis of this design is very carefully considered 
>>>>> in the book. However, the simple characterization of the panic 
>>>>> threshold is that if exceeded, it will not get better no matter how 
>>>>> long you wait.
>>>>
>>>>
>>>>
>>>>
>>>
>>

-- 
blu

"Remember 'A Thousand Points of Light'? With a network, we now have
a thousand points of failure."
----------------------------------------------------------------------
Brian Utterback - Solaris RPE, Sun Microsystems, Inc.
Ph:877-259-7345, Em:brian.utterback-at-ess-you-enn-dot-kom


More information about the hackers mailing list