[ntp:hackers] What to do when the offset is WAYTOOBIG

David L. Mills mills at udel.edu
Thu Apr 19 16:23:09 PDT 2007


Brian,

A message to the system log is sent if the panic threshold is exceeded. 
It would be a simple matter to continue operation without setting the 
system clock under the assumption that the system operator, locally or 
remotely, set the clock manually. But, we need a way for the syslog 
message (or the related trap) to alert the operator about this 
condition. Judah has programmed this function to call his beeper, which 
seems to be an appropriate response. Better yet, call your Blackberry 
and fix it from there.

Most often, the operator is in an office down the cooridor and can ssh 
to the victim machine and fix the problem. A really neat way to do this 
is using the remote configuration features of the now deac in the water 
configuration rewrite code. This could be used both to monitor operation 
in real time as well as sent authenticated configuration commands in 
real time.

While not volunteering to do this myself, I propose a new option that, 
if present, avoids killing the process and instead calls a designated 
program. Also, a trap should be provided to alert the ntpq monitoring 
program which itself can call a designated program if the performance is 
out of bounds.

All this of course should eventually be done using SNMP infrastructure; 
however, upon closer inspection, SNMP is woefully inadequate with the 
data types and trip wires commonly used by ntpd.

Dave

Brian Utterback wrote:

> I agree with everything you said. As I said in my previous message,
> you have convinced me that trying to monkey with the clock selection
> is the wrong way to go, despite the fact that it has some nice
> properties. So I agree that the only thing left is the decision to
> step or fall on your sword. And as you note, the step threshold is
> configurable, so there is nothing to discuss if you want to step.
> However, I think that there is one more item on the agenda, namely
> whether or not it would be a good idea to fall on your sword or
> merely run and hide and call for help.
>
> So, my proposal is that instead of exiting with an error message,
> we do not step the clock, we do print an error message, and we
> mark the clock as insane (or otherwise stop sending out the time)
> We should ensure that there is a way to remotely set the allow_panic
> variable if the admin decides that is the way to fix things. I am
> happy to make the choice about whether or not to exit configurable
> as well.
>
> So, Dave, I guess the question I have is there a real technical reason
> why this is a bad idea? Is there a scenario where the behavior I
> am proposing behaves worse than the current behavior? We know that
> there are some common scenarios where it has more desirable behavior.
>
> David L. Mills wrote:
>
>> Brian,
>>
>> There are two separable issue to consider here, the clock selection 
>> algorithm and the panic threshold.
>>
>> The clock selection algorithm was designed only after considerable 
>> discusion, both in the commercial community (DEC) and in the computer 
>> science theory community. As has been noted, it is based from the 
>> model described in Keith Marzullo's dissertation and discussions at 
>> the Dagstuhl Conference in southern Germany. I say this to emphasize 
>> this model has been extensively vetted by a bunch of guys I trust.
>>
>> The absolute bedrock method in the design is to find the best 
>> majority subset of clocks that agree within some interval based on 
>> delay. There is strong theoretical and practical evidence that the 
>> true UTC is somewhere in the middle of what I call the intersection 
>> interval developed by the algorithm. In my example, the conditions 
>> are the same before the clock is set at 1200 and after the clock has 
>> been set manually. The three 1200 clocks remain the truechimers and 
>> the two 1300 clocks remain the falsetickers. The fact the 
>> falsetickers are sbove or below the panic threshold is not 
>> significant. The only time the panic threshold is important is at the 
>> time the clock is to be set.
>>
>> Notwithstanding the above, the important issue is whether to step the 
>> clock, wait for better times or call for (presumably) human 
>> intervention. I purposely chose a scenario where it was necessary to 
>> choose betwen two alternative cliques, the members of which were 
>> close together while the cliques themselves were far apart. It could 
>> be the nearest clique is within the panic threshold, but the 
>> selection algorithm considers that clique falseticker by the rules. 
>> Thus, the only thing remainin is whether to step to the truechimer 
>> clique or panic.
>>
>> In my example there is no credible way the two cliques come together 
>> left by themselves unless one of them is manually stepped. Should 
>> either clique be rescued, it joins the other clique and things get 
>> well. So, the only question remaining is when the panic threshold is 
>> exceeded, whether to fall on your sword or step. I submit this is a 
>> nonstarter to argue. You get to set the panic interval anywhere you 
>> like, perhaps 30 years might be appropriate.
>>
>> Dave
>>
>> Brian Utterback wrote:
>>
>>> Actually, your scenario is a good reason why it may not be a good
>>> idea to mark clocks that are outside the limit as insane and ignore
>>> them. If we were to ignore the three that say 1200, then we would only
>>> have the two that say 1200, right at the limit. So we step the clock to
>>> 1200. Now we have all five available and since three say 1200, we step
>>> the clock to 1200, effectively circumventing the panic limit.
>>>
>>> All I am saying is that if you exit there is only one recourse, to
>>> manually restart. The problem could be permanent, the problem could
>>> be transient. In either case, somebody needs to log on the system and
>>> restart the daemon.
>>>
>>> On the other hand, if you instead stop serving time but don't exit,
>>> then if the problem is transient then no intervention is required. If
>>> it is permanent but fixable upstream, again no intervention is 
>>> required.
>>> If it is permanent and local (I'm thinking somebody set the local 
>>> clock)
>>> then it might be fixed by resetting allow_panic (can that be done 
>>> remotely? With the new config stuff?). And finally, it might still
>>> require a local login, but that would have happened either way.
>>>
>>> No matter how I slice it, it seems better to me to stay alive and
>>> hopeful even if those hopes are dashed, then to commit suicide. If
>>> you stop serving the time downstream, then the effect on the NTP
>>> network is the same either way, but by staying alive you can allow
>>> remote diagnosis and keep calling for help periodically.
>>>
>>> David L. Mills wrote:
>>>
>>>> Brian,
>>>>
>>>> I am watching five clocks. Three of the say 1200, two say 1300 and 
>>>> my clock says 1400. Since the majority of clocks I watch say 1200, 
>>>> I conclude the real time is 1220, but that is beyond my panic limit 
>>>> of one hour. Should I wait until things "get better"? I think not. 
>>>> I could make the panic limit over two hours and things would get 
>>>> better real quick. Or, I could use the -x option. so the first 
>>>> panic would be forgiven and my clock would read 1200. If after that 
>>>> a warp occurs over 1000 s relative to the majority clique, there 
>>>> may be a stuck bit in the hardware clock (that's happened) and I 
>>>> need to jump the train right away.
>>>>
>>>> Dave
>>>>
>>>> Brian Utterback wrote:
>>>>
>>>>> But is this a valid characterization? And even if it is mostly 
>>>>> true, what harm is there in waiting to see if it gets better. I 
>>>>> think Judah has the right idea, namely if the going get tough, 
>>>>> just sit down, shut
>>>>> up and pretend that you don't exist until things get better. That is,
>>>>> go ahead and yell, don't step the clock but don't serve time in case
>>>>> you might be off, but be willing to start up again if things get 
>>>>> better
>>>>> later. This seems like the best of both worlds.
>>>>>
>>>>> David L. Mills wrote:
>>>>>
>>>>>>
>>>>>> The philosophical basis of this design is very carefully 
>>>>>> considered in the book. However, the simple characterization of 
>>>>>> the panic threshold is that if exceeded, it will not get better 
>>>>>> no matter how long you wait.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>



More information about the hackers mailing list