[ntp:hackers] What to do when the offset is WAYTOOBIG
brian.utterback at sun.com
Thu Apr 19 05:21:34 PDT 2007
Actually, your scenario is a good reason why it may not be a good
idea to mark clocks that are outside the limit as insane and ignore
them. If we were to ignore the three that say 1200, then we would only
have the two that say 1200, right at the limit. So we step the clock to
1200. Now we have all five available and since three say 1200, we step
the clock to 1200, effectively circumventing the panic limit.
All I am saying is that if you exit there is only one recourse, to
manually restart. The problem could be permanent, the problem could
be transient. In either case, somebody needs to log on the system and
restart the daemon.
On the other hand, if you instead stop serving time but don't exit,
then if the problem is transient then no intervention is required. If
it is permanent but fixable upstream, again no intervention is required.
If it is permanent and local (I'm thinking somebody set the local clock)
then it might be fixed by resetting allow_panic (can that be done
remotely? With the new config stuff?). And finally, it might still
require a local login, but that would have happened either way.
No matter how I slice it, it seems better to me to stay alive and
hopeful even if those hopes are dashed, then to commit suicide. If
you stop serving the time downstream, then the effect on the NTP
network is the same either way, but by staying alive you can allow
remote diagnosis and keep calling for help periodically.
David L. Mills wrote:
> I am watching five clocks. Three of the say 1200, two say 1300 and my
> clock says 1400. Since the majority of clocks I watch say 1200, I
> conclude the real time is 1220, but that is beyond my panic limit of one
> hour. Should I wait until things "get better"? I think not. I could make
> the panic limit over two hours and things would get better real quick.
> Or, I could use the -x option. so the first panic would be forgiven and
> my clock would read 1200. If after that a warp occurs over 1000 s
> relative to the majority clique, there may be a stuck bit in the
> hardware clock (that's happened) and I need to jump the train right away.
> Brian Utterback wrote:
>> But is this a valid characterization? And even if it is mostly true,
>> what harm is there in waiting to see if it gets better. I think Judah
>> has the right idea, namely if the going get tough, just sit down, shut
>> up and pretend that you don't exist until things get better. That is,
>> go ahead and yell, don't step the clock but don't serve time in case
>> you might be off, but be willing to start up again if things get better
>> later. This seems like the best of both worlds.
>> David L. Mills wrote:
>>> The philosophical basis of this design is very carefully considered
>>> in the book. However, the simple characterization of the panic
>>> threshold is that if exceeded, it will not get better no matter how
>>> long you wait.
"Remember 'A Thousand Points of Light'? With a network, we now have
a thousand points of failure."
Brian Utterback - Solaris RPE, Sun Microsystems, Inc.
More information about the hackers