[ntp:hackers] Does ntpd need to whine more ?

David L. Mills mills at udel.edu
Mon Oct 3 16:08:20 UTC 2005


P-H,

Sorry; that last got away from me.

We are talking past each other. The issue of poll interval is related to 
time constant; indeed, that is the fundamental scaling assumption. The 
issue of how to recover after a lengthy outage is different. The 
disciplined oscillator does not care if updates occur more often than 
expetecte; that is oversampling. It does care very much about 
undersampling, but as long as one update in eight polls is delivered 
from the clock filter, the transient response is preserved.

The design intent is that the best recovery after an outage is using 
iburst mode, so it takes no more than 16 s to refresh the clock filter 
in that case, once the first response from the server arrives. Due to 
backoff, this might not happen until after 1024 s in the default case. 
It does not seem prudent to get more agressive than this; othewise 
something like the Wisconsin incident might happen again.

Your comment about rapid temperature excursions leading to steps is very 
relavent. I don't see that here in room temperature controlled 
environments, but laptops could be another story. As in other things, 
there are a set of competing compromises with escalating complexity. The 
worst case is the clock filter discarding all but one sample every eight 
polls and a second-order frequency change of several milliseconds per 
minute squared. In principle this can be controlled by the hysteresis 
limits, now +-30. The limits could even be adaptive, but that adds in 
ever more complexity and fragility.

I don't understand your scenarios with cold rock behavior after a 
lengthy outage. My experience here with typical systems is within 50 ms 
after 36 h between ACTS updates and 10 ms after several hours of WWV 
signal loss, and this with cold rock frequency compensation up to a 
couple hundred PPM. I've run the ACTS and WWV drivers for several months 
without ever stepping.

As for outage notification, the daemon does log reachability events now. 
It could be that it should do this every hour or something like that; I 
have no problem with that. The NIST folks use the filegen facility and 
call the helpdesk beeper if something goes wrong in any of their servers.

Dave

Poul-Henning Kamp wrote:

> In message <43413DB6.308 at udel.edu>, "David L. Mills" writes:
>
>> There is a fundamental misunderstanding here.
>
>
> Agreed, but we may not agree what the misunderstanding is.
>
>> There is a fundamental misunderstanding here. The clock discipline is in
>> fact a flywheel which is nudged at each poll update to correct the time
>> and update the frequency estimate. If you stop nudging it for awhile it
>> may accumulate error, but not much. How long should you wait before
>> declaring unsynchronized?
>
>
> I don't think it is unreasonable to expect people to have a plain
> XO (unless they tell NTPD otherwise) and therefore few systems
> actually have a recoverable offset after one day on the island.
>
> And expecting to recapture with a poll of 1024 after free-wheeling
> for a day is waaaay more optimistic than 25 cent XO's deserve.
>
> I would say that once the shift register runs dry, we should
> reduce the poll rate (if minpoll allows) for every empty shift
> register we see:
>
> That way you should have a scenario like:
>
> 0 poll = 1024 shift=11111111
> 1024 poll = 1024 shift=11111110
> 2048 poll = 1024 shift=11111100
> 3072 poll = 1024 shift=11111000
> 4096 poll = 1024 shift=11110000
> 5120 poll = 1024 shift=11100000
> 6144 poll = 1024 shift=11000000
> 7168 poll = 1024 shift=10000000
> 8192 poll = 1024 shift=00000000, reduce poll, start timer 512 * 8
> 12288 poll = 512 shift=00000000, reduce poll, start timer 256 * 8
> 14336 poll = 256 shift=00000000, reduce poll, start timer 128 * 8
> 15360 poll = 128 shift=00000000, reduce poll, start timer 64 * 8
> 15872 poll = 64 shift=00000000 at minpoll, do nothing
>
> That way we are back to 64s poll rate after 4h24m and that sounds
> very compatible with typical XO performance.
>
> In general the majority of NTPD synchronized machines suffer from
> diurnal wobble, so even 12 hours wouldn't be unreasonable.
>
>> The clock discipline algorithm is very good at estimating the optimum
>> time constant.
>
>
> Actually it isn't.
>
> It is far too eager to wander up to 1024 and due to the time delay
> of the shift register it takes ages for it to find out it got too
> far and it usually ends up stepping to get back in sync.
>
> The worst case situation is actually incredibly common: You wander
> up to 1024, and temperature changes, so your offset grows. The
> shift register filles up with monotonically increasing offsets and
> we get a systematic delay of [3...4] x 1024 seconds before the PLL
> ever hears about the existence of the offset.
>
> Iburst mode is certainly a big improvement but not very widely used
> yet.
>
>> You are invited to concoct
>> counterexamples, but I will believe them only if confirmed by actual
>> scenarios in vivo or better yet in simulation.
>
>
> My first and primary beef is that we do not whine loudly when we
> have lost reachability, no matter how long this has been going on.
>
> Can't we at least agree that after being unreachable for N hours
> we should syslog something rather severe ?
>
> I'd propose 24 for N, but even 168 will improve on the current
> situation where people have no inkling that their system has
> wandered off into the sunset.
>
>> The local clock is a terrible idea, unless for the only purpose to
>> wrangle a herd to a common timescale in response to a loss of outside
>> synchronization.
>
>
> Agreed. I belive some OS bogusly ships with a stratum 11 localclock
> and whoever decided that should be forced to polish the hands of
> Big Ben until he or it wears out.
>
> But in this case, localclock only obscures the problem, it is not
> the basic problem.
>



More information about the hackers mailing list