[ntp:hackers] Does ntpd need to whine more ?

todd.glassey at att.net todd.glassey at att.net
Tue Oct 4 10:15:23 PDT 2005


Dave -
Ahh yes - more Physics vs. Auditing and Evidence... 

 -------------- Original message ----------------------
From: "David L. Mills" <mills at udel.edu>
> P-H,
> 
> About islands, you and I are on different ones in different oceans.
> 
> I have had and do have configurations where the interval between updates 
> is over a day to several days. Once the frequency has stabilized, I 
> expect it to average out over the day no worse than 1 PPM, which results 
> in a residual error no worse than 84 ms over the day, almost always much 
> less than that. This is under normal heating and cooling home/office 
> conditions. That's what I designed the poll-adjust routine for.

Which sets a performance granularity for how NTP practices time management. Cool - we need to document this in the NTP performance and practice specification...

> 
> We have in the department and on my wires about two dozen NTP servers 
> with a total of probably several thousand clients. So far as I know from 
> log sampling the only time any of them have taken a step is after 
> reboot. The experiments I have done with WWV and ACTS involved several 
> different machines quite charitably considered junkbox at best. Only the 
> Alphae have anything other than a commodity oscillator. I have graphs of 
> free-run frequency versus time for quite a collection of junkbox 
> machines and they vary typically over a band less than one PPM over a 
> day, again under home/office conditions. Once continuously disciplined 
> by NTP, I expect the performance claimed above.

Good - 

> 
> I fully understand in the jungles of Borneo and the Arctic tundra such 
> assumptions are invalid. 

So is the use of either of these locations to play out the extremes of what a clock chip would be subjected to... BTW - I spent time in the geophysical year down in Antarctica, and our computer room had both heating and cooling facilities as well as an air-dryer to pull frozen water out of the atmosphere that was NEVER down. In fact... that computer center ran better than any number of other critical 7x24 data centers I have managed.

Likewise, in Borneo, there are no computer rooms in the jungles of the island that I know of - so they aren't real possibilities either.

> To those operators I suggest reducing the 
> maxpoll to less than 1024 s. 

Those operators have temperature controlled environments, and so they operate as does your machine.

> As I said in my previous, a more responsive 
> behavior can easily be engineered by judicious choice of parameters, and 
> that can be done on a per-scenario basis. 

Which is the exact reason a better installer is needed - to more adequately configure the server/client system for its actual use model.

> I am reluctant to do this for 
> the general population, as it invites a vulnerability for network 
> traffic.

Uh, cool - so let me as an professional auditor tell you to do it anyway... and that will handle your reluctance right?

>  There may in fact be merit in dynamic adjustment of those 
> parameters, 

>From an audit perspective perhaps - but if you can show a discipline that keeps the machine to within the NIST or USNO Standard Deviation, then whoa nelly... thats where we need to go.

> but I am on other missions just now. You will not like my 
> proposed approach using the simulator and crafted signal generators.

As an Auditor - if you can package the sim operations so that they can be audited to be a part of the digital proofing model - then OK... otherwise, you are right, they are of very little value in the field or in a court of law.
> 
>  From my experience you need to carefully separate normal versus 
> anomalistic behavior. Sometimes during experiments with frequent 
> restarts and miscellaneous error conditions a significant frequency 
> torque results. 

These don't tend to happen in production environments.

> I'm not surprised, as the initial conditions can be 
> quite chaotic. Once the daemon has run awhile and without disturbance, 
> performance resumes as above.
> 
> Apparently, my message on poll interval engineering has not been 
> received. I do not want to reduce the poll interval once the shift 
> register clears. In fact, I want to increase it to reduce network load. 
> Note very carefully the loop time constant is not increased and, once 
> reachability is restored, the poll interval resumes where it left off. 
> Therefore, it does no good with respect to frequency correction to 
> reduce the poll interval less than this - this simply oversamples with 
> little frequency effect.
> 
> You may argue, notwithstanding the polling issue, that the time constant 
> should be reduced after a long hiatus. That could be very expensive for 
> a modem service, for example. The right way to do this is to adjust the 
> parameters of the poll-adjust algorithm as previously described. I'd 
> rather have a dialog on time constant; the poll interval issue is 
> secondary to that.
> 
> Also, if the frequency has really been torqued, start the recover in 
> state 3, where the discipline computes the frequency adjustment 
> directly. To evaluate that approach, disable the kernel, remove the 
> frequency file, introduce a serious frequency error with ntptime and 
> start the daemon. After 900 s the frequency should be set within 1 PPM, 
> possibly accompanied by a step.
> 
> Dave
> 
> Poul-Henning Kamp wrote:
> 
> >In message <43413DB6.308 at udel.edu>, "David L. Mills" writes:
> >
> >  
> >
> >>There is a fundamental misunderstanding here.
> >>    
> >>
> >
> >Agreed, but we may not agree what the misunderstanding is.
> >
> >  
> >
> >>There is a fundamental misunderstanding here. The clock discipline is in 
> >>fact a flywheel which is nudged at each poll update to correct the time 
> >>and update the frequency estimate. If you stop nudging it for awhile it 
> >>may accumulate error, but not much. How long should you wait before 
> >>declaring unsynchronized?
> >>    
> >>
> >
> >I don't think it is unreasonable to expect people to have a plain
> >XO (unless they tell NTPD otherwise) and therefore few systems
> >actually have a recoverable offset after one day on the island.
> >
> >And expecting to recapture with a poll of 1024 after free-wheeling
> >for a day is waaaay more optimistic than 25 cent XO's deserve.
> >
> >I would say that once the shift register runs dry, we should
> >reduce the poll rate (if minpoll allows) for every empty shift
> >register we see:
> >
> >That way you should have a scenario like:
> >
> >0	poll = 1024 shift=11111111
> >1024	poll = 1024 shift=11111110
> >2048	poll = 1024 shift=11111100
> >3072	poll = 1024 shift=11111000
> >4096	poll = 1024 shift=11110000
> >5120	poll = 1024 shift=11100000
> >6144	poll = 1024 shift=11000000
> >7168	poll = 1024 shift=10000000
> >8192	poll = 1024 shift=00000000, reduce poll, start timer 512 * 8
> >12288	poll = 512  shift=00000000, reduce poll, start timer 256 * 8
> >14336	poll = 256  shift=00000000, reduce poll, start timer 128 * 8
> >15360	poll = 128  shift=00000000, reduce poll, start timer 64 * 8
> >15872	poll = 64   shift=00000000  at minpoll, do nothing
> >
> >That way we are back to 64s poll rate after 4h24m and that sounds
> >very compatible with typical XO performance.
> >
> >In general the majority of NTPD synchronized machines suffer from
> >diurnal wobble, so even 12 hours wouldn't be unreasonable.
> >
> >  
> >
> >>The clock discipline algorithm is very good at estimating the optimum 
> >>time constant.
> >>    
> >>
> >
> >Actually it isn't.
> >
> >It is far too eager to wander up to 1024 and due to the time delay
> >of the shift register it takes ages for it to find out it got too
> >far and it usually ends up stepping to get back in sync.
> >
> >The worst case situation is actually incredibly common:  You wander
> >up to 1024, and temperature changes, so your offset grows.  The
> >shift register filles up with monotonically increasing offsets and
> >we get a systematic delay of [3...4] x 1024 seconds before the PLL
> >ever hears about the existence of the offset.
> >
> >Iburst mode is certainly a big improvement but not very widely used
> >yet.
> >
> >  
> >
> >>You are invited to concoct 
> >>counterexamples, but I will believe them only if confirmed by actual 
> >>scenarios in vivo or better yet in simulation.
> >>    
> >>
> >
> >My first and primary beef is that we do not whine loudly when we
> >have lost reachability, no matter how long this has been going on.
> >
> >Can't we at least agree that after being unreachable for N hours
> >we should syslog something rather severe ?
> >
> >I'd propose 24 for N, but even 168 will improve on the current
> >situation where people have no inkling that their system has
> >wandered off into the sunset.
> >
> >  
> >
> >>The local clock is a terrible idea, unless for the only purpose to 
> >>wrangle a herd to a common timescale in response to a loss of outside 
> >>synchronization.
> >>    
> >>
> >
> >Agreed.  I belive some OS bogusly ships with a stratum 11 localclock
> >and whoever decided that should be forced to polish the hands of
> >Big Ben until he or it wears out.
> >
> >But in this case, localclock only obscures the problem, it is not
> >the basic problem.
> >
> >  
> >
> 
> _______________________________________________
> hackers mailing list
> hackers at support.ntp.org
> https://support.ntp.org/mailman/listinfo/hackers




More information about the hackers mailing list