[ntp:hackers] Does ntpd need to whine more ?

David L. Mills mills at udel.edu
Tue Oct 4 15:49:21 UTC 2005


About islands, you and I are on different ones in different oceans.

I have had and do have configurations where the interval between updates 
is over a day to several days. Once the frequency has stabilized, I 
expect it to average out over the day no worse than 1 PPM, which results 
in a residual error no worse than 84 ms over the day, almost always much 
less than that. This is under normal heating and cooling home/office 
conditions. That's what I designed the poll-adjust routine for.

We have in the department and on my wires about two dozen NTP servers 
with a total of probably several thousand clients. So far as I know from 
log sampling the only time any of them have taken a step is after 
reboot. The experiments I have done with WWV and ACTS involved several 
different machines quite charitably considered junkbox at best. Only the 
Alphae have anything other than a commodity oscillator. I have graphs of 
free-run frequency versus time for quite a collection of junkbox 
machines and they vary typically over a band less than one PPM over a 
day, again under home/office conditions. Once continuously disciplined 
by NTP, I expect the performance claimed above.

I fully understand in the jungles of Borneo and the Arctic tundra such 
assumptions are invalid. To those operators I suggest reducing the 
maxpoll to less than 1024 s. As I said in my previous, a more responsive 
behavior can easily be engineered by judicious choice of parameters, and 
that can be done on a per-scenario basis. I am reluctant to do this for 
the general population, as it invites a vulnerability for network 
traffic. There may in fact be merit in dynamic adjustment of those 
parameters, but I am on other missions just now. You will not like my 
proposed approach using the simulator and crafted signal generators.

 From my experience you need to carefully separate normal versus 
anomalistic behavior. Sometimes during experiments with frequent 
restarts and miscellaneous error conditions a significant frequency 
torque results. I'm not surprised, as the initial conditions can be 
quite chaotic. Once the daemon has run awhile and without disturbance, 
performance resumes as above.

Apparently, my message on poll interval engineering has not been 
received. I do not want to reduce the poll interval once the shift 
register clears. In fact, I want to increase it to reduce network load. 
Note very carefully the loop time constant is not increased and, once 
reachability is restored, the poll interval resumes where it left off. 
Therefore, it does no good with respect to frequency correction to 
reduce the poll interval less than this - this simply oversamples with 
little frequency effect.

You may argue, notwithstanding the polling issue, that the time constant 
should be reduced after a long hiatus. That could be very expensive for 
a modem service, for example. The right way to do this is to adjust the 
parameters of the poll-adjust algorithm as previously described. I'd 
rather have a dialog on time constant; the poll interval issue is 
secondary to that.

Also, if the frequency has really been torqued, start the recover in 
state 3, where the discipline computes the frequency adjustment 
directly. To evaluate that approach, disable the kernel, remove the 
frequency file, introduce a serious frequency error with ntptime and 
start the daemon. After 900 s the frequency should be set within 1 PPM, 
possibly accompanied by a step.


Poul-Henning Kamp wrote:

>In message <43413DB6.308 at udel.edu>, "David L. Mills" writes:
>>There is a fundamental misunderstanding here.
>Agreed, but we may not agree what the misunderstanding is.
>>There is a fundamental misunderstanding here. The clock discipline is in 
>>fact a flywheel which is nudged at each poll update to correct the time 
>>and update the frequency estimate. If you stop nudging it for awhile it 
>>may accumulate error, but not much. How long should you wait before 
>>declaring unsynchronized?
>I don't think it is unreasonable to expect people to have a plain
>XO (unless they tell NTPD otherwise) and therefore few systems
>actually have a recoverable offset after one day on the island.
>And expecting to recapture with a poll of 1024 after free-wheeling
>for a day is waaaay more optimistic than 25 cent XO's deserve.
>I would say that once the shift register runs dry, we should
>reduce the poll rate (if minpoll allows) for every empty shift
>register we see:
>That way you should have a scenario like:
>0	poll = 1024 shift=11111111
>1024	poll = 1024 shift=11111110
>2048	poll = 1024 shift=11111100
>3072	poll = 1024 shift=11111000
>4096	poll = 1024 shift=11110000
>5120	poll = 1024 shift=11100000
>6144	poll = 1024 shift=11000000
>7168	poll = 1024 shift=10000000
>8192	poll = 1024 shift=00000000, reduce poll, start timer 512 * 8
>12288	poll = 512  shift=00000000, reduce poll, start timer 256 * 8
>14336	poll = 256  shift=00000000, reduce poll, start timer 128 * 8
>15360	poll = 128  shift=00000000, reduce poll, start timer 64 * 8
>15872	poll = 64   shift=00000000  at minpoll, do nothing
>That way we are back to 64s poll rate after 4h24m and that sounds
>very compatible with typical XO performance.
>In general the majority of NTPD synchronized machines suffer from
>diurnal wobble, so even 12 hours wouldn't be unreasonable.
>>The clock discipline algorithm is very good at estimating the optimum 
>>time constant.
>Actually it isn't.
>It is far too eager to wander up to 1024 and due to the time delay
>of the shift register it takes ages for it to find out it got too
>far and it usually ends up stepping to get back in sync.
>The worst case situation is actually incredibly common:  You wander
>up to 1024, and temperature changes, so your offset grows.  The
>shift register filles up with monotonically increasing offsets and
>we get a systematic delay of [3...4] x 1024 seconds before the PLL
>ever hears about the existence of the offset.
>Iburst mode is certainly a big improvement but not very widely used
>>You are invited to concoct 
>>counterexamples, but I will believe them only if confirmed by actual 
>>scenarios in vivo or better yet in simulation.
>My first and primary beef is that we do not whine loudly when we
>have lost reachability, no matter how long this has been going on.
>Can't we at least agree that after being unreachable for N hours
>we should syslog something rather severe ?
>I'd propose 24 for N, but even 168 will improve on the current
>situation where people have no inkling that their system has
>wandered off into the sunset.
>>The local clock is a terrible idea, unless for the only purpose to 
>>wrangle a herd to a common timescale in response to a loss of outside 
>Agreed.  I belive some OS bogusly ships with a stratum 11 localclock
>and whoever decided that should be forced to polish the hands of
>Big Ben until he or it wears out.
>But in this case, localclock only obscures the problem, it is not
>the basic problem.

More information about the hackers mailing list