[ntp:hackers] What to do when the offset is WAYTOOBIG

Brian Utterback brian.utterback at sun.com
Fri Apr 20 11:46:47 PDT 2007


This issue is a show stopper for me regarding the integration of
NTPv4 into Solaris Nevada. I cannot guarantee to make progress since
it wouldn't be part of my day job, but if you can point me in the
right direction, I could take a stab at it. Where is the new code?
And, Harlan, the new command line code requires autogen to modify
options, am I right? I think the comments in the README files are
behind in the versions of autogen, autoconf, automake. What versions
are required now?

David L. Mills wrote:
> Brian,
> 
> So far as I know, nobody has volunteered to integrate the rewrite with 
> the current working code. It was all set to go when Harlan changed the 
> command line option code, which was not compatible with the rewrite. 
> This is not to say Harlan's change was good or bad, just that the 
> programmer assigned to the rewrite was assigned a new job before he 
> figured it out.
> 
> Dave
> 
> Brian Utterback wrote:
> 
>> What you describe is exactly what I am looking for. All too often the
>> program exits with only that single message in the syslog and is 
>> overlooked, and worse, not even available remotely.
>>
>> I am concerned by your use of the phrase "dead in the water 
>> configuration rewrite code". During recent discussions, I was given
>> to understand that the config rewrite code was likely to be integrated
>> fairly soon. Is that no longer the case?
>>
>> David L. Mills wrote:
>>
>>> Most often, the operator is in an office down the cooridor and can ssh 
>>> to the victim machine and fix the problem. A really neat way to do 
>>> this is using the remote configuration features of the now deac in the 
>>> water configuration rewrite code. This could be used both to monitor 
>>> operation in real time as well as sent authenticated configuration 
>>> commands in real time.
>>>
>>> While not volunteering to do this myself, I propose a new option that, 
>>> if present, avoids killing the process and instead calls a designated 
>>> program. Also, a trap should be provided to alert the ntpq monitoring 
>>> program which itself can call a designated program if the performance 
>>> is out of bounds.
>>>
>>> All this of course should eventually be done using SNMP 
>>> infrastructure; however, upon closer inspection, SNMP is woefully 
>>> inadequate with the data types and trip wires commonly used by ntpd.
>>>
>>> Dave
>>>
>>> Brian Utterback wrote:
>>>
>>>> I agree with everything you said. As I said in my previous message,
>>>> you have convinced me that trying to monkey with the clock selection
>>>> is the wrong way to go, despite the fact that it has some nice
>>>> properties. So I agree that the only thing left is the decision to
>>>> step or fall on your sword. And as you note, the step threshold is
>>>> configurable, so there is nothing to discuss if you want to step.
>>>> However, I think that there is one more item on the agenda, namely
>>>> whether or not it would be a good idea to fall on your sword or
>>>> merely run and hide and call for help.
>>>>
>>>> So, my proposal is that instead of exiting with an error message,
>>>> we do not step the clock, we do print an error message, and we
>>>> mark the clock as insane (or otherwise stop sending out the time)
>>>> We should ensure that there is a way to remotely set the allow_panic
>>>> variable if the admin decides that is the way to fix things. I am
>>>> happy to make the choice about whether or not to exit configurable
>>>> as well.
>>>>
>>>> So, Dave, I guess the question I have is there a real technical reason
>>>> why this is a bad idea? Is there a scenario where the behavior I
>>>> am proposing behaves worse than the current behavior? We know that
>>>> there are some common scenarios where it has more desirable behavior.
>>>>
>>>> David L. Mills wrote:
>>>>
>>>>> Brian,
>>>>>
>>>>> There are two separable issue to consider here, the clock selection 
>>>>> algorithm and the panic threshold.
>>>>>
>>>>> The clock selection algorithm was designed only after considerable 
>>>>> discusion, both in the commercial community (DEC) and in the 
>>>>> computer science theory community. As has been noted, it is based 
>>>>> from the model described in Keith Marzullo's dissertation and 
>>>>> discussions at the Dagstuhl Conference in southern Germany. I say 
>>>>> this to emphasize this model has been extensively vetted by a bunch 
>>>>> of guys I trust.
>>>>>
>>>>> The absolute bedrock method in the design is to find the best 
>>>>> majority subset of clocks that agree within some interval based on 
>>>>> delay. There is strong theoretical and practical evidence that the 
>>>>> true UTC is somewhere in the middle of what I call the intersection 
>>>>> interval developed by the algorithm. In my example, the conditions 
>>>>> are the same before the clock is set at 1200 and after the clock has 
>>>>> been set manually. The three 1200 clocks remain the truechimers and 
>>>>> the two 1300 clocks remain the falsetickers. The fact the 
>>>>> falsetickers are sbove or below the panic threshold is not 
>>>>> significant. The only time the panic threshold is important is at 
>>>>> the time the clock is to be set.
>>>>>
>>>>> Notwithstanding the above, the important issue is whether to step 
>>>>> the clock, wait for better times or call for (presumably) human 
>>>>> intervention. I purposely chose a scenario where it was necessary to 
>>>>> choose betwen two alternative cliques, the members of which were 
>>>>> close together while the cliques themselves were far apart. It could 
>>>>> be the nearest clique is within the panic threshold, but the 
>>>>> selection algorithm considers that clique falseticker by the rules. 
>>>>> Thus, the only thing remainin is whether to step to the truechimer 
>>>>> clique or panic.
>>>>>
>>>>> In my example there is no credible way the two cliques come together 
>>>>> left by themselves unless one of them is manually stepped. Should 
>>>>> either clique be rescued, it joins the other clique and things get 
>>>>> well. So, the only question remaining is when the panic threshold is 
>>>>> exceeded, whether to fall on your sword or step. I submit this is a 
>>>>> nonstarter to argue. You get to set the panic interval anywhere you 
>>>>> like, perhaps 30 years might be appropriate.
>>>>>
>>>>> Dave
>>>>>
>>>>> Brian Utterback wrote:
>>>>>
>>>>>> Actually, your scenario is a good reason why it may not be a good
>>>>>> idea to mark clocks that are outside the limit as insane and ignore
>>>>>> them. If we were to ignore the three that say 1200, then we would only
>>>>>> have the two that say 1200, right at the limit. So we step the 
>>>>>> clock to
>>>>>> 1200. Now we have all five available and since three say 1200, we step
>>>>>> the clock to 1200, effectively circumventing the panic limit.
>>>>>>
>>>>>> All I am saying is that if you exit there is only one recourse, to
>>>>>> manually restart. The problem could be permanent, the problem could
>>>>>> be transient. In either case, somebody needs to log on the system and
>>>>>> restart the daemon.
>>>>>>
>>>>>> On the other hand, if you instead stop serving time but don't exit,
>>>>>> then if the problem is transient then no intervention is required. If
>>>>>> it is permanent but fixable upstream, again no intervention is 
>>>>>> required.
>>>>>> If it is permanent and local (I'm thinking somebody set the local 
>>>>>> clock)
>>>>>> then it might be fixed by resetting allow_panic (can that be done 
>>>>>> remotely? With the new config stuff?). And finally, it might still
>>>>>> require a local login, but that would have happened either way.
>>>>>>
>>>>>> No matter how I slice it, it seems better to me to stay alive and
>>>>>> hopeful even if those hopes are dashed, then to commit suicide. If
>>>>>> you stop serving the time downstream, then the effect on the NTP
>>>>>> network is the same either way, but by staying alive you can allow
>>>>>> remote diagnosis and keep calling for help periodically.
>>>>>>
>>>>>> David L. Mills wrote:
>>>>>>
>>>>>>> Brian,
>>>>>>>
>>>>>>> I am watching five clocks. Three of the say 1200, two say 1300 and 
>>>>>>> my clock says 1400. Since the majority of clocks I watch say 1200, 
>>>>>>> I conclude the real time is 1220, but that is beyond my panic 
>>>>>>> limit of one hour. Should I wait until things "get better"? I 
>>>>>>> think not. I could make the panic limit over two hours and things 
>>>>>>> would get better real quick. Or, I could use the -x option. so the 
>>>>>>> first panic would be forgiven and my clock would read 1200. If 
>>>>>>> after that a warp occurs over 1000 s relative to the majority 
>>>>>>> clique, there may be a stuck bit in the hardware clock (that's 
>>>>>>> happened) and I need to jump the train right away.
>>>>>>>
>>>>>>> Dave
>>>>>>>
>>>>>>> Brian Utterback wrote:
>>>>>>>
>>>>>>>> But is this a valid characterization? And even if it is mostly 
>>>>>>>> true, what harm is there in waiting to see if it gets better. I 
>>>>>>>> think Judah has the right idea, namely if the going get tough, 
>>>>>>>> just sit down, shut
>>>>>>>> up and pretend that you don't exist until things get better. That 
>>>>>>>> is,
>>>>>>>> go ahead and yell, don't step the clock but don't serve time in case
>>>>>>>> you might be off, but be willing to start up again if things get 
>>>>>>>> better
>>>>>>>> later. This seems like the best of both worlds.
>>>>>>>>
>>>>>>>> David L. Mills wrote:
>>>>>>>>
>>>>>>>>> The philosophical basis of this design is very carefully 
>>>>>>>>> considered in the book. However, the simple characterization of 
>>>>>>>>> the panic threshold is that if exceeded, it will not get better 
>>>>>>>>> no matter how long you wait.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
> _______________________________________________
> hackers mailing list
> hackers at support.ntp.org
> https://support.ntp.org/mailman/listinfo/hackers

-- 
blu

"Remember 'A Thousand Points of Light'? With a network, we now have
a thousand points of failure."
----------------------------------------------------------------------
Brian Utterback - Solaris RPE, Sun Microsystems, Inc.
Ph:877-259-7345, Em:brian.utterback-at-ess-you-enn-dot-kom


More information about the hackers mailing list