[ntp:questions] Re: Frequent time reset messages

Bob Robison bob.robison at swri.org
Sun Dec 4 02:13:01 UTC 2005


I've tried setting the Hz down to 100, but it didn't help -- and seems
to have caused other timing problems, but I may revisit that at some
point.  I do believe the problem is lost interrupts, but I don't know
how to be sure without hacking the firewire driver and trying to
determine how long it is in there.  And if I found out, I'm not sure
what I could do.

At the moment I'm trying to get it to simply step/adjust more often.
I've adjusted minpoll/maxpoll to 4 (to get 16 secs) and I understand
that the default step threshold is 128ms.  However I still see offsets
that are larger than that, and the 'steps' only occur every 20 or 30
minutes.  How can I have it happen more often? 

other comments below:

On Fri, 02 Dec 2005 15:13:31 -0600
hmurray at suespammers.org (Hal Murray) wrote:

> In article <20051201161213.5eb0e8fd at caspian>,
>  bob.robison at swri.org (Bob Robison) writes:
> >I'm running a moderate number (around 50) dual-opterons that are
> >diskless booting a Linux 2.6.12 smp kernel and trying to synch with a
> >Symmetricon XLI-GPS stratum-1 NTP server on an isolated network.
> >
> >The problem I have is that when I run "ntpq -c peers" on a number of
> >these machines to check the status of the ntp synchronization, I see
> >offsets ranging over almost 1000 msecs.  If I grep through
> >the /var/log/ messages file, I see that there are often messages
> >around every 20 minutes like this:
> >
> >Dec  1 20:30:28 (none) ntpd[27203]: time reset 0.613771 s
> >Dec  1 20:30:28 (none) ntpd[27203]: synchronisation lost
> >Dec  1 20:50:45 (none) ntpd[27203]: time reset 0.931388 s
> >Dec  1 20:50:45 (none) ntpd[27203]: synchronisation lost
> >Dec  1 21:19:23 (none) ntpd[27203]: time reset 0.451491 s
> >Dec  1 21:19:23 (none) ntpd[27203]: synchronisation lost
> >Dec  1 21:36:24 (none) ntpd[27203]: time reset 0.391510 s
> >Dec  1 21:36:24 (none) ntpd[27203]: synchronisation lost
> 
> Somebody else suggested lost interrupts.  That would be pretty
> high on my list.
> 
> What happens if you let one of the systems just sit there without
> doing anything?  If it keeps good time your problem is
> probably caused by your normal workload.

I need to try this... haven't done that yet because of coordination
with other things going on in the system.  Will move this up on the
priority list.

--->>>> Tried this before sending email:  Still gets off, even with nothing happening on system.... more confused now....
> 
> 
> > Probably the main issue is the CPU and I/O loading on these opteron
> > machines.  They are each handling streaming data from a firewire
> > card (IEEE-1394a) and the CPUs stay fairly busy handling that data
> > -- though they are not pegged at 100% or anything.
> 
> The issue is not so much if you are using all the CPU, but if the
> clock update interrupt routine is being locked out long enough to
> miss an interrupt because the second one comes in before the first
> one has been processed.

Yes.. I understand.  However, I've seen references to 'too many lost
ticks' error messages in the kernel logs, but I never see these.  So,
I'm not sure why not.
> 
> If I was trying to understand this, I'd consider patching the
> firewire interrupt routine to turn on a printer port bit at the
> start and turn it off at the end, and then put a scope on that
> pin to see how long it was on.  Most modern (digital) scopes
> have a trigger on X longer then Y mode that will show you the
> bad cases.
> 
> Or do it all in software by grabbing the cycle counter and making
> a histogram.

I may have to do this... but holding off if I can.

bob





More information about the questions mailing list