[ntp:hackers] What to do when the offset is WAYTOOBIG

David L. Mills mills at udel.edu
Thu Apr 19 10:59:18 PDT 2007


There are two separable issue to consider here, the clock selection 
algorithm and the panic threshold.

The clock selection algorithm was designed only after considerable 
discusion, both in the commercial community (DEC) and in the computer 
science theory community. As has been noted, it is based from the model 
described in Keith Marzullo's dissertation and discussions at the 
Dagstuhl Conference in southern Germany. I say this to emphasize this 
model has been extensively vetted by a bunch of guys I trust.

The absolute bedrock method in the design is to find the best majority 
subset of clocks that agree within some interval based on delay. There 
is strong theoretical and practical evidence that the true UTC is 
somewhere in the middle of what I call the intersection interval 
developed by the algorithm. In my example, the conditions are the same 
before the clock is set at 1200 and after the clock has been set 
manually. The three 1200 clocks remain the truechimers and the two 1300 
clocks remain the falsetickers. The fact the falsetickers are sbove or 
below the panic threshold is not significant. The only time the panic 
threshold is important is at the time the clock is to be set.

Notwithstanding the above, the important issue is whether to step the 
clock, wait for better times or call for (presumably) human 
intervention. I purposely chose a scenario where it was necessary to 
choose betwen two alternative cliques, the members of which were close 
together while the cliques themselves were far apart. It could be the 
nearest clique is within the panic threshold, but the selection 
algorithm considers that clique falseticker by the rules. Thus, the only 
thing remainin is whether to step to the truechimer clique or panic.

In my example there is no credible way the two cliques come together 
left by themselves unless one of them is manually stepped. Should either 
clique be rescued, it joins the other clique and things get well. So, 
the only question remaining is when the panic threshold is exceeded, 
whether to fall on your sword or step. I submit this is a nonstarter to 
argue. You get to set the panic interval anywhere you like, perhaps 30 
years might be appropriate.


Brian Utterback wrote:
> Actually, your scenario is a good reason why it may not be a good
> idea to mark clocks that are outside the limit as insane and ignore
> them. If we were to ignore the three that say 1200, then we would only
> have the two that say 1200, right at the limit. So we step the clock to
> 1200. Now we have all five available and since three say 1200, we step
> the clock to 1200, effectively circumventing the panic limit.
> All I am saying is that if you exit there is only one recourse, to
> manually restart. The problem could be permanent, the problem could
> be transient. In either case, somebody needs to log on the system and
> restart the daemon.
> On the other hand, if you instead stop serving time but don't exit,
> then if the problem is transient then no intervention is required. If
> it is permanent but fixable upstream, again no intervention is required.
> If it is permanent and local (I'm thinking somebody set the local clock)
> then it might be fixed by resetting allow_panic (can that be done 
> remotely? With the new config stuff?). And finally, it might still
> require a local login, but that would have happened either way.
> No matter how I slice it, it seems better to me to stay alive and
> hopeful even if those hopes are dashed, then to commit suicide. If
> you stop serving the time downstream, then the effect on the NTP
> network is the same either way, but by staying alive you can allow
> remote diagnosis and keep calling for help periodically.
> David L. Mills wrote:
>> Brian,
>> I am watching five clocks. Three of the say 1200, two say 1300 and my 
>> clock says 1400. Since the majority of clocks I watch say 1200, I 
>> conclude the real time is 1220, but that is beyond my panic limit of 
>> one hour. Should I wait until things "get better"? I think not. I 
>> could make the panic limit over two hours and things would get better 
>> real quick. Or, I could use the -x option. so the first panic would be 
>> forgiven and my clock would read 1200. If after that a warp occurs 
>> over 1000 s relative to the majority clique, there may be a stuck bit 
>> in the hardware clock (that's happened) and I need to jump the train 
>> right away.
>> Dave
>> Brian Utterback wrote:
>>> But is this a valid characterization? And even if it is mostly true, 
>>> what harm is there in waiting to see if it gets better. I think Judah 
>>> has the right idea, namely if the going get tough, just sit down, shut
>>> up and pretend that you don't exist until things get better. That is,
>>> go ahead and yell, don't step the clock but don't serve time in case
>>> you might be off, but be willing to start up again if things get better
>>> later. This seems like the best of both worlds.
>>> David L. Mills wrote:
>>>> The philosophical basis of this design is very carefully considered 
>>>> in the book. However, the simple characterization of the panic 
>>>> threshold is that if exceeded, it will not get better no matter how 
>>>> long you wait.

More information about the hackers mailing list