[ntp:hackers] Does ntpd need to whine more ?
David L. Mills
mills at udel.edu
Tue Oct 4 15:49:21 UTC 2005
About islands, you and I are on different ones in different oceans.
I have had and do have configurations where the interval between updates
is over a day to several days. Once the frequency has stabilized, I
expect it to average out over the day no worse than 1 PPM, which results
in a residual error no worse than 84 ms over the day, almost always much
less than that. This is under normal heating and cooling home/office
conditions. That's what I designed the poll-adjust routine for.
We have in the department and on my wires about two dozen NTP servers
with a total of probably several thousand clients. So far as I know from
log sampling the only time any of them have taken a step is after
reboot. The experiments I have done with WWV and ACTS involved several
different machines quite charitably considered junkbox at best. Only the
Alphae have anything other than a commodity oscillator. I have graphs of
free-run frequency versus time for quite a collection of junkbox
machines and they vary typically over a band less than one PPM over a
day, again under home/office conditions. Once continuously disciplined
by NTP, I expect the performance claimed above.
I fully understand in the jungles of Borneo and the Arctic tundra such
assumptions are invalid. To those operators I suggest reducing the
maxpoll to less than 1024 s. As I said in my previous, a more responsive
behavior can easily be engineered by judicious choice of parameters, and
that can be done on a per-scenario basis. I am reluctant to do this for
the general population, as it invites a vulnerability for network
traffic. There may in fact be merit in dynamic adjustment of those
parameters, but I am on other missions just now. You will not like my
proposed approach using the simulator and crafted signal generators.
From my experience you need to carefully separate normal versus
anomalistic behavior. Sometimes during experiments with frequent
restarts and miscellaneous error conditions a significant frequency
torque results. I'm not surprised, as the initial conditions can be
quite chaotic. Once the daemon has run awhile and without disturbance,
performance resumes as above.
Apparently, my message on poll interval engineering has not been
received. I do not want to reduce the poll interval once the shift
register clears. In fact, I want to increase it to reduce network load.
Note very carefully the loop time constant is not increased and, once
reachability is restored, the poll interval resumes where it left off.
Therefore, it does no good with respect to frequency correction to
reduce the poll interval less than this - this simply oversamples with
little frequency effect.
You may argue, notwithstanding the polling issue, that the time constant
should be reduced after a long hiatus. That could be very expensive for
a modem service, for example. The right way to do this is to adjust the
parameters of the poll-adjust algorithm as previously described. I'd
rather have a dialog on time constant; the poll interval issue is
secondary to that.
Also, if the frequency has really been torqued, start the recover in
state 3, where the discipline computes the frequency adjustment
directly. To evaluate that approach, disable the kernel, remove the
frequency file, introduce a serious frequency error with ntptime and
start the daemon. After 900 s the frequency should be set within 1 PPM,
possibly accompanied by a step.
Poul-Henning Kamp wrote:
>In message <43413DB6.308 at udel.edu>, "David L. Mills" writes:
>>There is a fundamental misunderstanding here.
>Agreed, but we may not agree what the misunderstanding is.
>>There is a fundamental misunderstanding here. The clock discipline is in
>>fact a flywheel which is nudged at each poll update to correct the time
>>and update the frequency estimate. If you stop nudging it for awhile it
>>may accumulate error, but not much. How long should you wait before
>I don't think it is unreasonable to expect people to have a plain
>XO (unless they tell NTPD otherwise) and therefore few systems
>actually have a recoverable offset after one day on the island.
>And expecting to recapture with a poll of 1024 after free-wheeling
>for a day is waaaay more optimistic than 25 cent XO's deserve.
>I would say that once the shift register runs dry, we should
>reduce the poll rate (if minpoll allows) for every empty shift
>register we see:
>That way you should have a scenario like:
>0 poll = 1024 shift=11111111
>1024 poll = 1024 shift=11111110
>2048 poll = 1024 shift=11111100
>3072 poll = 1024 shift=11111000
>4096 poll = 1024 shift=11110000
>5120 poll = 1024 shift=11100000
>6144 poll = 1024 shift=11000000
>7168 poll = 1024 shift=10000000
>8192 poll = 1024 shift=00000000, reduce poll, start timer 512 * 8
>12288 poll = 512 shift=00000000, reduce poll, start timer 256 * 8
>14336 poll = 256 shift=00000000, reduce poll, start timer 128 * 8
>15360 poll = 128 shift=00000000, reduce poll, start timer 64 * 8
>15872 poll = 64 shift=00000000 at minpoll, do nothing
>That way we are back to 64s poll rate after 4h24m and that sounds
>very compatible with typical XO performance.
>In general the majority of NTPD synchronized machines suffer from
>diurnal wobble, so even 12 hours wouldn't be unreasonable.
>>The clock discipline algorithm is very good at estimating the optimum
>Actually it isn't.
>It is far too eager to wander up to 1024 and due to the time delay
>of the shift register it takes ages for it to find out it got too
>far and it usually ends up stepping to get back in sync.
>The worst case situation is actually incredibly common: You wander
>up to 1024, and temperature changes, so your offset grows. The
>shift register filles up with monotonically increasing offsets and
>we get a systematic delay of [3...4] x 1024 seconds before the PLL
>ever hears about the existence of the offset.
>Iburst mode is certainly a big improvement but not very widely used
>>You are invited to concoct
>>counterexamples, but I will believe them only if confirmed by actual
>>scenarios in vivo or better yet in simulation.
>My first and primary beef is that we do not whine loudly when we
>have lost reachability, no matter how long this has been going on.
>Can't we at least agree that after being unreachable for N hours
>we should syslog something rather severe ?
>I'd propose 24 for N, but even 168 will improve on the current
>situation where people have no inkling that their system has
>wandered off into the sunset.
>>The local clock is a terrible idea, unless for the only purpose to
>>wrangle a herd to a common timescale in response to a loss of outside
>Agreed. I belive some OS bogusly ships with a stratum 11 localclock
>and whoever decided that should be forced to polish the hands of
>Big Ben until he or it wears out.
>But in this case, localclock only obscures the problem, it is not
>the basic problem.
More information about the hackers