[ntp:questions] high precision tracking: trying to understand sudden jumps

starlight at binnacle.cx starlight at binnacle.cx
Sun Mar 30 17:09:43 UTC 2008


Hello,

I'm trying to configure a small network for high precision time. 
Recently acquired an Endrun CDMA time server that runs like 
a dream, tracking CDMA time to about +/- 5 microseconds.

The clients are a rag-tag assembly of diverse systems including 
a Centos 4.5 Linux i686, Linux x86_64, Sun Ultra 10, Sun Ultra 80, 
IBM RS/6000 44p, Windows 2003 X64, and a Windows XP laptop.

All are configured to prefer the Endrun clock and poll it on a 
16 second interval.  All are attached to a single SMC gigabit 
Ethernet switch with only the Endrun and two Sun systems running 
at a lower speed of 100 MBPS.  Close to zero network traffic
and system loads.

All systems are running 'ntpd' 4.2.4p4.  Compiled NTP native 
64-bit for the Windows X64 system.  [A #ifdef tweak to 
'intptr_t' and 'uintptr_t' is required, will provide patch if 
desired].

It generally is working well, with the systems tracking anywhere 
from +/- 100 microseconds to +/- 500 microseconds most of the 
time.

However once or twice a day, all the systems experience a 
random, uncorrelated time shift of from one to several 
milliseconds.  Had an issue where a UPS voltage correction shift 
and cheap power supply on the Windows X64 box appeared to be a
problem, but that was fixed by configuring the UPS to consider 
110V nominal instead of 120V.

Does anyone have any ideas about what could be causing these 
random time jumps and what might be done to eliminate them?

Something I'm planning to try is to make sure that 'mlock' is 
configured in the daemons--presently 'autoconf' has left it 
disabled for some reason.  However I don't belive page
faults are the culprit.  All the daemons are running at 
the highest real-time priority in the respective systems.

The above configuration is a controlled lab setup.  The next 
target is a stack eight of DELL 1950 servers in a production 
data center running Windows 2003 R2 and slaved to a newer Endrun 
time server.  Don't have useful data from these systems yet 
because the network jitter is outrageous.  Working with the 
network admin to hopefully have the NTP traffic to and from the 
Endrun clock bypass level 3 switch/router rule checking.  They 
have large, complex router ACL rulesets I suspect as the cause
of the jitter.

Attached are fairly representative graphs of the offset and 
frequency for two of the lab servers.

Thanks


P.S. Resent without graphs as the list mailer says
they're not allowed.  Happy to send them or the raw
'loopstats' to anyone interested.




More information about the questions mailing list