[ntp:questions] Re: Frequent time reset messages
bob.robison at swri.org
Sun Dec 4 02:13:01 UTC 2005
I've tried setting the Hz down to 100, but it didn't help -- and seems
to have caused other timing problems, but I may revisit that at some
point. I do believe the problem is lost interrupts, but I don't know
how to be sure without hacking the firewire driver and trying to
determine how long it is in there. And if I found out, I'm not sure
what I could do.
At the moment I'm trying to get it to simply step/adjust more often.
I've adjusted minpoll/maxpoll to 4 (to get 16 secs) and I understand
that the default step threshold is 128ms. However I still see offsets
that are larger than that, and the 'steps' only occur every 20 or 30
minutes. How can I have it happen more often?
other comments below:
On Fri, 02 Dec 2005 15:13:31 -0600
hmurray at suespammers.org (Hal Murray) wrote:
> In article <20051201161213.5eb0e8fd at caspian>,
> bob.robison at swri.org (Bob Robison) writes:
> >I'm running a moderate number (around 50) dual-opterons that are
> >diskless booting a Linux 2.6.12 smp kernel and trying to synch with a
> >Symmetricon XLI-GPS stratum-1 NTP server on an isolated network.
> >The problem I have is that when I run "ntpq -c peers" on a number of
> >these machines to check the status of the ntp synchronization, I see
> >offsets ranging over almost 1000 msecs. If I grep through
> >the /var/log/ messages file, I see that there are often messages
> >around every 20 minutes like this:
> >Dec 1 20:30:28 (none) ntpd: time reset 0.613771 s
> >Dec 1 20:30:28 (none) ntpd: synchronisation lost
> >Dec 1 20:50:45 (none) ntpd: time reset 0.931388 s
> >Dec 1 20:50:45 (none) ntpd: synchronisation lost
> >Dec 1 21:19:23 (none) ntpd: time reset 0.451491 s
> >Dec 1 21:19:23 (none) ntpd: synchronisation lost
> >Dec 1 21:36:24 (none) ntpd: time reset 0.391510 s
> >Dec 1 21:36:24 (none) ntpd: synchronisation lost
> Somebody else suggested lost interrupts. That would be pretty
> high on my list.
> What happens if you let one of the systems just sit there without
> doing anything? If it keeps good time your problem is
> probably caused by your normal workload.
I need to try this... haven't done that yet because of coordination
with other things going on in the system. Will move this up on the
--->>>> Tried this before sending email: Still gets off, even with nothing happening on system.... more confused now....
> > Probably the main issue is the CPU and I/O loading on these opteron
> > machines. They are each handling streaming data from a firewire
> > card (IEEE-1394a) and the CPUs stay fairly busy handling that data
> > -- though they are not pegged at 100% or anything.
> The issue is not so much if you are using all the CPU, but if the
> clock update interrupt routine is being locked out long enough to
> miss an interrupt because the second one comes in before the first
> one has been processed.
Yes.. I understand. However, I've seen references to 'too many lost
ticks' error messages in the kernel logs, but I never see these. So,
I'm not sure why not.
> If I was trying to understand this, I'd consider patching the
> firewire interrupt routine to turn on a printer port bit at the
> start and turn it off at the end, and then put a scope on that
> pin to see how long it was on. Most modern (digital) scopes
> have a trigger on X longer then Y mode that will show you the
> bad cases.
> Or do it all in software by grabbing the cycle counter and making
> a histogram.
I may have to do this... but holding off if I can.
More information about the questions