[ntp:hackers] What to do when the offset is WAYTOOBIG

David L. Mills mills at udel.edu
Fri Apr 20 14:03:32 PDT 2007


Brian and Harlan,

The alleged configuration rewrite tree is on pogo.udel.edu under user 
name kamboj, new-config-bk-repo. The file modification dates reveal 
someody working on it recently, like 20 April. Could be Sachin.

Dave

Brian Utterback wrote:
> This issue is a show stopper for me regarding the integration of
> NTPv4 into Solaris Nevada. I cannot guarantee to make progress since
> it wouldn't be part of my day job, but if you can point me in the
> right direction, I could take a stab at it. Where is the new code?
> And, Harlan, the new command line code requires autogen to modify
> options, am I right? I think the comments in the README files are
> behind in the versions of autogen, autoconf, automake. What versions
> are required now?
> 
> David L. Mills wrote:
> 
>> Brian,
>>
>> So far as I know, nobody has volunteered to integrate the rewrite with 
>> the current working code. It was all set to go when Harlan changed the 
>> command line option code, which was not compatible with the rewrite. 
>> This is not to say Harlan's change was good or bad, just that the 
>> programmer assigned to the rewrite was assigned a new job before he 
>> figured it out.
>>
>> Dave
>>
>> Brian Utterback wrote:
>>
>>> What you describe is exactly what I am looking for. All too often the
>>> program exits with only that single message in the syslog and is 
>>> overlooked, and worse, not even available remotely.
>>>
>>> I am concerned by your use of the phrase "dead in the water 
>>> configuration rewrite code". During recent discussions, I was given
>>> to understand that the config rewrite code was likely to be integrated
>>> fairly soon. Is that no longer the case?
>>>
>>> David L. Mills wrote:
>>>
>>>> Most often, the operator is in an office down the cooridor and can 
>>>> ssh to the victim machine and fix the problem. A really neat way to 
>>>> do this is using the remote configuration features of the now deac 
>>>> in the water configuration rewrite code. This could be used both to 
>>>> monitor operation in real time as well as sent authenticated 
>>>> configuration commands in real time.
>>>>
>>>> While not volunteering to do this myself, I propose a new option 
>>>> that, if present, avoids killing the process and instead calls a 
>>>> designated program. Also, a trap should be provided to alert the 
>>>> ntpq monitoring program which itself can call a designated program 
>>>> if the performance is out of bounds.
>>>>
>>>> All this of course should eventually be done using SNMP 
>>>> infrastructure; however, upon closer inspection, SNMP is woefully 
>>>> inadequate with the data types and trip wires commonly used by ntpd.
>>>>
>>>> Dave
>>>>
>>>> Brian Utterback wrote:
>>>>
>>>>> I agree with everything you said. As I said in my previous message,
>>>>> you have convinced me that trying to monkey with the clock selection
>>>>> is the wrong way to go, despite the fact that it has some nice
>>>>> properties. So I agree that the only thing left is the decision to
>>>>> step or fall on your sword. And as you note, the step threshold is
>>>>> configurable, so there is nothing to discuss if you want to step.
>>>>> However, I think that there is one more item on the agenda, namely
>>>>> whether or not it would be a good idea to fall on your sword or
>>>>> merely run and hide and call for help.
>>>>>
>>>>> So, my proposal is that instead of exiting with an error message,
>>>>> we do not step the clock, we do print an error message, and we
>>>>> mark the clock as insane (or otherwise stop sending out the time)
>>>>> We should ensure that there is a way to remotely set the allow_panic
>>>>> variable if the admin decides that is the way to fix things. I am
>>>>> happy to make the choice about whether or not to exit configurable
>>>>> as well.
>>>>>
>>>>> So, Dave, I guess the question I have is there a real technical reason
>>>>> why this is a bad idea? Is there a scenario where the behavior I
>>>>> am proposing behaves worse than the current behavior? We know that
>>>>> there are some common scenarios where it has more desirable behavior.
>>>>>
>>>>> David L. Mills wrote:
>>>>>
>>>>>> Brian,
>>>>>>
>>>>>> There are two separable issue to consider here, the clock 
>>>>>> selection algorithm and the panic threshold.
>>>>>>
>>>>>> The clock selection algorithm was designed only after considerable 
>>>>>> discusion, both in the commercial community (DEC) and in the 
>>>>>> computer science theory community. As has been noted, it is based 
>>>>>> from the model described in Keith Marzullo's dissertation and 
>>>>>> discussions at the Dagstuhl Conference in southern Germany. I say 
>>>>>> this to emphasize this model has been extensively vetted by a 
>>>>>> bunch of guys I trust.
>>>>>>
>>>>>> The absolute bedrock method in the design is to find the best 
>>>>>> majority subset of clocks that agree within some interval based on 
>>>>>> delay. There is strong theoretical and practical evidence that the 
>>>>>> true UTC is somewhere in the middle of what I call the 
>>>>>> intersection interval developed by the algorithm. In my example, 
>>>>>> the conditions are the same before the clock is set at 1200 and 
>>>>>> after the clock has been set manually. The three 1200 clocks 
>>>>>> remain the truechimers and the two 1300 clocks remain the 
>>>>>> falsetickers. The fact the falsetickers are sbove or below the 
>>>>>> panic threshold is not significant. The only time the panic 
>>>>>> threshold is important is at the time the clock is to be set.
>>>>>>
>>>>>> Notwithstanding the above, the important issue is whether to step 
>>>>>> the clock, wait for better times or call for (presumably) human 
>>>>>> intervention. I purposely chose a scenario where it was necessary 
>>>>>> to choose betwen two alternative cliques, the members of which 
>>>>>> were close together while the cliques themselves were far apart. 
>>>>>> It could be the nearest clique is within the panic threshold, but 
>>>>>> the selection algorithm considers that clique falseticker by the 
>>>>>> rules. Thus, the only thing remainin is whether to step to the 
>>>>>> truechimer clique or panic.
>>>>>>
>>>>>> In my example there is no credible way the two cliques come 
>>>>>> together left by themselves unless one of them is manually 
>>>>>> stepped. Should either clique be rescued, it joins the other 
>>>>>> clique and things get well. So, the only question remaining is 
>>>>>> when the panic threshold is exceeded, whether to fall on your 
>>>>>> sword or step. I submit this is a nonstarter to argue. You get to 
>>>>>> set the panic interval anywhere you like, perhaps 30 years might 
>>>>>> be appropriate.
>>>>>>
>>>>>> Dave
>>>>>>
>>>>>> Brian Utterback wrote:
>>>>>>
>>>>>>> Actually, your scenario is a good reason why it may not be a good
>>>>>>> idea to mark clocks that are outside the limit as insane and ignore
>>>>>>> them. If we were to ignore the three that say 1200, then we would 
>>>>>>> only
>>>>>>> have the two that say 1200, right at the limit. So we step the 
>>>>>>> clock to
>>>>>>> 1200. Now we have all five available and since three say 1200, we 
>>>>>>> step
>>>>>>> the clock to 1200, effectively circumventing the panic limit.
>>>>>>>
>>>>>>> All I am saying is that if you exit there is only one recourse, to
>>>>>>> manually restart. The problem could be permanent, the problem could
>>>>>>> be transient. In either case, somebody needs to log on the system 
>>>>>>> and
>>>>>>> restart the daemon.
>>>>>>>
>>>>>>> On the other hand, if you instead stop serving time but don't exit,
>>>>>>> then if the problem is transient then no intervention is 
>>>>>>> required. If
>>>>>>> it is permanent but fixable upstream, again no intervention is 
>>>>>>> required.
>>>>>>> If it is permanent and local (I'm thinking somebody set the local 
>>>>>>> clock)
>>>>>>> then it might be fixed by resetting allow_panic (can that be done 
>>>>>>> remotely? With the new config stuff?). And finally, it might still
>>>>>>> require a local login, but that would have happened either way.
>>>>>>>
>>>>>>> No matter how I slice it, it seems better to me to stay alive and
>>>>>>> hopeful even if those hopes are dashed, then to commit suicide. If
>>>>>>> you stop serving the time downstream, then the effect on the NTP
>>>>>>> network is the same either way, but by staying alive you can allow
>>>>>>> remote diagnosis and keep calling for help periodically.
>>>>>>>
>>>>>>> David L. Mills wrote:
>>>>>>>
>>>>>>>> Brian,
>>>>>>>>
>>>>>>>> I am watching five clocks. Three of the say 1200, two say 1300 
>>>>>>>> and my clock says 1400. Since the majority of clocks I watch say 
>>>>>>>> 1200, I conclude the real time is 1220, but that is beyond my 
>>>>>>>> panic limit of one hour. Should I wait until things "get 
>>>>>>>> better"? I think not. I could make the panic limit over two 
>>>>>>>> hours and things would get better real quick. Or, I could use 
>>>>>>>> the -x option. so the first panic would be forgiven and my clock 
>>>>>>>> would read 1200. If after that a warp occurs over 1000 s 
>>>>>>>> relative to the majority clique, there may be a stuck bit in the 
>>>>>>>> hardware clock (that's happened) and I need to jump the train 
>>>>>>>> right away.
>>>>>>>>
>>>>>>>> Dave
>>>>>>>>
>>>>>>>> Brian Utterback wrote:
>>>>>>>>
>>>>>>>>> But is this a valid characterization? And even if it is mostly 
>>>>>>>>> true, what harm is there in waiting to see if it gets better. I 
>>>>>>>>> think Judah has the right idea, namely if the going get tough, 
>>>>>>>>> just sit down, shut
>>>>>>>>> up and pretend that you don't exist until things get better. 
>>>>>>>>> That is,
>>>>>>>>> go ahead and yell, don't step the clock but don't serve time in 
>>>>>>>>> case
>>>>>>>>> you might be off, but be willing to start up again if things 
>>>>>>>>> get better
>>>>>>>>> later. This seems like the best of both worlds.
>>>>>>>>>
>>>>>>>>> David L. Mills wrote:
>>>>>>>>>
>>>>>>>>>> The philosophical basis of this design is very carefully 
>>>>>>>>>> considered in the book. However, the simple characterization 
>>>>>>>>>> of the panic threshold is that if exceeded, it will not get 
>>>>>>>>>> better no matter how long you wait.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>> _______________________________________________
>> hackers mailing list
>> hackers at support.ntp.org
>> https://support.ntp.org/mailman/listinfo/hackers
> 
> 


More information about the hackers mailing list