[ntp:questions] Re: Frequent time reset messages

Hal Murray hmurray at suespammers.org
Fri Dec 2 21:13:31 UTC 2005


In article <20051201161213.5eb0e8fd at caspian>,
 bob.robison at swri.org (Bob Robison) writes:
>I'm running a moderate number (around 50) dual-opterons that are
>diskless booting a Linux 2.6.12 smp kernel and trying to synch with a
>Symmetricon XLI-GPS stratum-1 NTP server on an isolated network.
>
>The problem I have is that when I run "ntpq -c peers" on a number of
>these machines to check the status of the ntp synchronization, I see
>offsets ranging over almost 1000 msecs.  If I grep through the /var/log/
>messages file, I see that there are often messages around every 20
>minutes like this:
>
>Dec  1 20:30:28 (none) ntpd[27203]: time reset 0.613771 s
>Dec  1 20:30:28 (none) ntpd[27203]: synchronisation lost
>Dec  1 20:50:45 (none) ntpd[27203]: time reset 0.931388 s
>Dec  1 20:50:45 (none) ntpd[27203]: synchronisation lost
>Dec  1 21:19:23 (none) ntpd[27203]: time reset 0.451491 s
>Dec  1 21:19:23 (none) ntpd[27203]: synchronisation lost
>Dec  1 21:36:24 (none) ntpd[27203]: time reset 0.391510 s
>Dec  1 21:36:24 (none) ntpd[27203]: synchronisation lost

Somebody else suggested lost interrupts.  That would be pretty
high on my list.

What happens if you let one of the systems just sit there without
doing anything?  If it keeps good time your problem is
probably caused by your normal workload.


> Probably the main issue is the CPU and I/O loading on these opteron
> machines.  They are each handling streaming data from a firewire card
> (IEEE-1394a) and the CPUs stay fairly busy handling that data -- though
> they are not pegged at 100% or anything.

The issue is not so much if you are using all the CPU, but if the
clock update interrupt routine is being locked out long enough to
miss an interrupt because the second one comes in before the first
one has been processed.

If I was trying to understand this, I'd consider patching the
firewire interrupt routine to turn on a printer port bit at the
start and turn it off at the end, and then put a scope on that
pin to see how long it was on.  Most modern (digital) scopes
have a trigger on X longer then Y mode that will show you the
bad cases.

Or do it all in software by grabbing the cycle counter and making
a histogram.

-- 
The suespammers.org mail server is located in California.  So are all my
other mailboxes.  Please do not send unsolicited bulk e-mail or unsolicited
commercial e-mail to my suespammers.org address or any of my other addresses.
These are my opinions, not necessarily my employer's.  I hate spam.




More information about the questions mailing list