[ntp:hackers] Does ntpd need to whine more ?

Poul-Henning Kamp phk at phk.freebsd.dk
Mon Oct 3 14:47:34 UTC 2005


In message <43413DB6.308 at udel.edu>, "David L. Mills" writes:

>There is a fundamental misunderstanding here.

Agreed, but we may not agree what the misunderstanding is.

>There is a fundamental misunderstanding here. The clock discipline is in 
>fact a flywheel which is nudged at each poll update to correct the time 
>and update the frequency estimate. If you stop nudging it for awhile it 
>may accumulate error, but not much. How long should you wait before 
>declaring unsynchronized?

I don't think it is unreasonable to expect people to have a plain
XO (unless they tell NTPD otherwise) and therefore few systems
actually have a recoverable offset after one day on the island.

And expecting to recapture with a poll of 1024 after free-wheeling
for a day is waaaay more optimistic than 25 cent XO's deserve.

I would say that once the shift register runs dry, we should
reduce the poll rate (if minpoll allows) for every empty shift
register we see:

That way you should have a scenario like:

0	poll = 1024 shift=11111111
1024	poll = 1024 shift=11111110
2048	poll = 1024 shift=11111100
3072	poll = 1024 shift=11111000
4096	poll = 1024 shift=11110000
5120	poll = 1024 shift=11100000
6144	poll = 1024 shift=11000000
7168	poll = 1024 shift=10000000
8192	poll = 1024 shift=00000000, reduce poll, start timer 512 * 8
12288	poll = 512  shift=00000000, reduce poll, start timer 256 * 8
14336	poll = 256  shift=00000000, reduce poll, start timer 128 * 8
15360	poll = 128  shift=00000000, reduce poll, start timer 64 * 8
15872	poll = 64   shift=00000000  at minpoll, do nothing

That way we are back to 64s poll rate after 4h24m and that sounds
very compatible with typical XO performance.

In general the majority of NTPD synchronized machines suffer from
diurnal wobble, so even 12 hours wouldn't be unreasonable.

>The clock discipline algorithm is very good at estimating the optimum 
>time constant.

Actually it isn't.

It is far too eager to wander up to 1024 and due to the time delay
of the shift register it takes ages for it to find out it got too
far and it usually ends up stepping to get back in sync.

The worst case situation is actually incredibly common:  You wander
up to 1024, and temperature changes, so your offset grows.  The
shift register filles up with monotonically increasing offsets and
we get a systematic delay of [3...4] x 1024 seconds before the PLL
ever hears about the existence of the offset.

Iburst mode is certainly a big improvement but not very widely used
yet.

>You are invited to concoct 
>counterexamples, but I will believe them only if confirmed by actual 
>scenarios in vivo or better yet in simulation.

My first and primary beef is that we do not whine loudly when we
have lost reachability, no matter how long this has been going on.

Can't we at least agree that after being unreachable for N hours
we should syslog something rather severe ?

I'd propose 24 for N, but even 168 will improve on the current
situation where people have no inkling that their system has
wandered off into the sunset.

>The local clock is a terrible idea, unless for the only purpose to 
>wrangle a herd to a common timescale in response to a loss of outside 
>synchronization.

Agreed.  I belive some OS bogusly ships with a stratum 11 localclock
and whoever decided that should be forced to polish the hands of
Big Ben until he or it wears out.

But in this case, localclock only obscures the problem, it is not
the basic problem.

-- 
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
phk at FreeBSD.ORG         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe    
Never attribute to malice what can adequately be explained by incompetence.


More information about the hackers mailing list