[ntp:questions] high precision tracking: trying to understand sudden jumps

starlight at binnacle.cx starlight at binnacle.cx
Sun Mar 30 20:22:07 UTC 2008


Here are URLs for those two sample graphs:

http://binnacle.cx/file/ntp_hickups_linux.gif
http://binnacle.cx/file/ntp_hickups_win.gif

David Woolley wrote:
>
>> The clients are a rag-tag assembly of diverse systems including 
>> a Centos 4.5 Linux i686, Linux x86_64, Sun Ultra 10, Sun Ultra 80, 
>> IBM RS/6000 44p, Windows 2003 X64, and a Windows XP laptop.
>
>How are you interpolating the 16ms ticks on the Windows system?
>How are you disabling power management on the lap top?

The generic version of 'ntpd' has some sophisticated code that 
handles interpolation.  See the source.  Power management is 
disabled on the laptop using the standard control panel option.  
Don't really care that much about this machine anyway.

>> It generally is working well, with the systems tracking anywhere 
>> from +/- 100 microseconds to +/- 500 microseconds most of the 
>> time.
>
>How are you measuring the difference from true time?  In principle, if 
>ntpd can measure it, it will correct it.

Using 'ntpd' 'loopstats'.  It does, check out the graphs.

Maybe I'll turn on 'peerstats' too, but I really doubt a 
stand-alone good quality switch would be causing random delays.  
Pings are consistently 400 microseconds and 'ntpq -p' reports 800 
microsecond roundtrip delays.  I've never heard of a switch
causing a 5ms delay.

>> 
>> However once or twice a day, all the systems experience a 
>> random, uncorrelated time shift of from one to several 
>> milliseconds.  Had an issue where a UPS voltage correction shift 
>
>In which direction is the slip?  Backward only slips against true time 
>(these might appear as forward slips if the real error is in the server) 
>are typically due to lost clock interrupts.  If that is the case it 
>implies you are using a tick rate of other than 100Hz.  Please note that 
>the Linux kernel code is broken for clock frequencies other than 100Hz 
>and the use of 1000Hz significantly increases the likelihood of a lost 
>interrupt.

Perhaps that's a problem.  The RHEL/Centos stock kernel seems to
have a 1000Hz clock interrupt.  At least 'vmstat' shows 1000
ints/sec on an idle system.

>The normal source of lost interrupts is disk drivers using programmed 
>transfers.

Think it's all DMA.  Remember this is a really diverse bunch
of machines and OSs.  The RS/6000 is working the best.

These jumps aren't killing me.  Just want to figure out if they 
can be eliminated.  If we needed super accurate time we'd 
probably have make use of PTP (precision timing protocol).
Still très expensive.




More information about the questions mailing list