[ntp:hackers] What to do when the offset is WAYTOOBIG
David L. Mills
mills at udel.edu
Fri Apr 20 14:03:32 PDT 2007
Brian and Harlan,
The alleged configuration rewrite tree is on pogo.udel.edu under user
name kamboj, new-config-bk-repo. The file modification dates reveal
someody working on it recently, like 20 April. Could be Sachin.
Dave
Brian Utterback wrote:
> This issue is a show stopper for me regarding the integration of
> NTPv4 into Solaris Nevada. I cannot guarantee to make progress since
> it wouldn't be part of my day job, but if you can point me in the
> right direction, I could take a stab at it. Where is the new code?
> And, Harlan, the new command line code requires autogen to modify
> options, am I right? I think the comments in the README files are
> behind in the versions of autogen, autoconf, automake. What versions
> are required now?
>
> David L. Mills wrote:
>
>> Brian,
>>
>> So far as I know, nobody has volunteered to integrate the rewrite with
>> the current working code. It was all set to go when Harlan changed the
>> command line option code, which was not compatible with the rewrite.
>> This is not to say Harlan's change was good or bad, just that the
>> programmer assigned to the rewrite was assigned a new job before he
>> figured it out.
>>
>> Dave
>>
>> Brian Utterback wrote:
>>
>>> What you describe is exactly what I am looking for. All too often the
>>> program exits with only that single message in the syslog and is
>>> overlooked, and worse, not even available remotely.
>>>
>>> I am concerned by your use of the phrase "dead in the water
>>> configuration rewrite code". During recent discussions, I was given
>>> to understand that the config rewrite code was likely to be integrated
>>> fairly soon. Is that no longer the case?
>>>
>>> David L. Mills wrote:
>>>
>>>> Most often, the operator is in an office down the cooridor and can
>>>> ssh to the victim machine and fix the problem. A really neat way to
>>>> do this is using the remote configuration features of the now deac
>>>> in the water configuration rewrite code. This could be used both to
>>>> monitor operation in real time as well as sent authenticated
>>>> configuration commands in real time.
>>>>
>>>> While not volunteering to do this myself, I propose a new option
>>>> that, if present, avoids killing the process and instead calls a
>>>> designated program. Also, a trap should be provided to alert the
>>>> ntpq monitoring program which itself can call a designated program
>>>> if the performance is out of bounds.
>>>>
>>>> All this of course should eventually be done using SNMP
>>>> infrastructure; however, upon closer inspection, SNMP is woefully
>>>> inadequate with the data types and trip wires commonly used by ntpd.
>>>>
>>>> Dave
>>>>
>>>> Brian Utterback wrote:
>>>>
>>>>> I agree with everything you said. As I said in my previous message,
>>>>> you have convinced me that trying to monkey with the clock selection
>>>>> is the wrong way to go, despite the fact that it has some nice
>>>>> properties. So I agree that the only thing left is the decision to
>>>>> step or fall on your sword. And as you note, the step threshold is
>>>>> configurable, so there is nothing to discuss if you want to step.
>>>>> However, I think that there is one more item on the agenda, namely
>>>>> whether or not it would be a good idea to fall on your sword or
>>>>> merely run and hide and call for help.
>>>>>
>>>>> So, my proposal is that instead of exiting with an error message,
>>>>> we do not step the clock, we do print an error message, and we
>>>>> mark the clock as insane (or otherwise stop sending out the time)
>>>>> We should ensure that there is a way to remotely set the allow_panic
>>>>> variable if the admin decides that is the way to fix things. I am
>>>>> happy to make the choice about whether or not to exit configurable
>>>>> as well.
>>>>>
>>>>> So, Dave, I guess the question I have is there a real technical reason
>>>>> why this is a bad idea? Is there a scenario where the behavior I
>>>>> am proposing behaves worse than the current behavior? We know that
>>>>> there are some common scenarios where it has more desirable behavior.
>>>>>
>>>>> David L. Mills wrote:
>>>>>
>>>>>> Brian,
>>>>>>
>>>>>> There are two separable issue to consider here, the clock
>>>>>> selection algorithm and the panic threshold.
>>>>>>
>>>>>> The clock selection algorithm was designed only after considerable
>>>>>> discusion, both in the commercial community (DEC) and in the
>>>>>> computer science theory community. As has been noted, it is based
>>>>>> from the model described in Keith Marzullo's dissertation and
>>>>>> discussions at the Dagstuhl Conference in southern Germany. I say
>>>>>> this to emphasize this model has been extensively vetted by a
>>>>>> bunch of guys I trust.
>>>>>>
>>>>>> The absolute bedrock method in the design is to find the best
>>>>>> majority subset of clocks that agree within some interval based on
>>>>>> delay. There is strong theoretical and practical evidence that the
>>>>>> true UTC is somewhere in the middle of what I call the
>>>>>> intersection interval developed by the algorithm. In my example,
>>>>>> the conditions are the same before the clock is set at 1200 and
>>>>>> after the clock has been set manually. The three 1200 clocks
>>>>>> remain the truechimers and the two 1300 clocks remain the
>>>>>> falsetickers. The fact the falsetickers are sbove or below the
>>>>>> panic threshold is not significant. The only time the panic
>>>>>> threshold is important is at the time the clock is to be set.
>>>>>>
>>>>>> Notwithstanding the above, the important issue is whether to step
>>>>>> the clock, wait for better times or call for (presumably) human
>>>>>> intervention. I purposely chose a scenario where it was necessary
>>>>>> to choose betwen two alternative cliques, the members of which
>>>>>> were close together while the cliques themselves were far apart.
>>>>>> It could be the nearest clique is within the panic threshold, but
>>>>>> the selection algorithm considers that clique falseticker by the
>>>>>> rules. Thus, the only thing remainin is whether to step to the
>>>>>> truechimer clique or panic.
>>>>>>
>>>>>> In my example there is no credible way the two cliques come
>>>>>> together left by themselves unless one of them is manually
>>>>>> stepped. Should either clique be rescued, it joins the other
>>>>>> clique and things get well. So, the only question remaining is
>>>>>> when the panic threshold is exceeded, whether to fall on your
>>>>>> sword or step. I submit this is a nonstarter to argue. You get to
>>>>>> set the panic interval anywhere you like, perhaps 30 years might
>>>>>> be appropriate.
>>>>>>
>>>>>> Dave
>>>>>>
>>>>>> Brian Utterback wrote:
>>>>>>
>>>>>>> Actually, your scenario is a good reason why it may not be a good
>>>>>>> idea to mark clocks that are outside the limit as insane and ignore
>>>>>>> them. If we were to ignore the three that say 1200, then we would
>>>>>>> only
>>>>>>> have the two that say 1200, right at the limit. So we step the
>>>>>>> clock to
>>>>>>> 1200. Now we have all five available and since three say 1200, we
>>>>>>> step
>>>>>>> the clock to 1200, effectively circumventing the panic limit.
>>>>>>>
>>>>>>> All I am saying is that if you exit there is only one recourse, to
>>>>>>> manually restart. The problem could be permanent, the problem could
>>>>>>> be transient. In either case, somebody needs to log on the system
>>>>>>> and
>>>>>>> restart the daemon.
>>>>>>>
>>>>>>> On the other hand, if you instead stop serving time but don't exit,
>>>>>>> then if the problem is transient then no intervention is
>>>>>>> required. If
>>>>>>> it is permanent but fixable upstream, again no intervention is
>>>>>>> required.
>>>>>>> If it is permanent and local (I'm thinking somebody set the local
>>>>>>> clock)
>>>>>>> then it might be fixed by resetting allow_panic (can that be done
>>>>>>> remotely? With the new config stuff?). And finally, it might still
>>>>>>> require a local login, but that would have happened either way.
>>>>>>>
>>>>>>> No matter how I slice it, it seems better to me to stay alive and
>>>>>>> hopeful even if those hopes are dashed, then to commit suicide. If
>>>>>>> you stop serving the time downstream, then the effect on the NTP
>>>>>>> network is the same either way, but by staying alive you can allow
>>>>>>> remote diagnosis and keep calling for help periodically.
>>>>>>>
>>>>>>> David L. Mills wrote:
>>>>>>>
>>>>>>>> Brian,
>>>>>>>>
>>>>>>>> I am watching five clocks. Three of the say 1200, two say 1300
>>>>>>>> and my clock says 1400. Since the majority of clocks I watch say
>>>>>>>> 1200, I conclude the real time is 1220, but that is beyond my
>>>>>>>> panic limit of one hour. Should I wait until things "get
>>>>>>>> better"? I think not. I could make the panic limit over two
>>>>>>>> hours and things would get better real quick. Or, I could use
>>>>>>>> the -x option. so the first panic would be forgiven and my clock
>>>>>>>> would read 1200. If after that a warp occurs over 1000 s
>>>>>>>> relative to the majority clique, there may be a stuck bit in the
>>>>>>>> hardware clock (that's happened) and I need to jump the train
>>>>>>>> right away.
>>>>>>>>
>>>>>>>> Dave
>>>>>>>>
>>>>>>>> Brian Utterback wrote:
>>>>>>>>
>>>>>>>>> But is this a valid characterization? And even if it is mostly
>>>>>>>>> true, what harm is there in waiting to see if it gets better. I
>>>>>>>>> think Judah has the right idea, namely if the going get tough,
>>>>>>>>> just sit down, shut
>>>>>>>>> up and pretend that you don't exist until things get better.
>>>>>>>>> That is,
>>>>>>>>> go ahead and yell, don't step the clock but don't serve time in
>>>>>>>>> case
>>>>>>>>> you might be off, but be willing to start up again if things
>>>>>>>>> get better
>>>>>>>>> later. This seems like the best of both worlds.
>>>>>>>>>
>>>>>>>>> David L. Mills wrote:
>>>>>>>>>
>>>>>>>>>> The philosophical basis of this design is very carefully
>>>>>>>>>> considered in the book. However, the simple characterization
>>>>>>>>>> of the panic threshold is that if exceeded, it will not get
>>>>>>>>>> better no matter how long you wait.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>> _______________________________________________
>> hackers mailing list
>> hackers at support.ntp.org
>> https://support.ntp.org/mailman/listinfo/hackers
>
>
More information about the hackers
mailing list