[ntp:questions] Bizzare half second disagreement between ntp hosts

Unruh unruh-spam at physics.ubc.ca
Fri Jan 11 19:19:17 UTC 2008

"Luis Colorado" <luis.colorado at hispalinux.es> writes:

>"Unruh" <unruh-spam at physics.ubc.ca> escribi=F3 en el mensaje =
>news:aVChj.26825$fj2.1600 at edtnps82...
>> "Luis Colorado" <luis.colorado at hispalinux.es> writes:
>>>Just suppose you are downloading some great file what creates =
>asymetries =3D
>>>in the data flow bandwidth.  When that happens, you observe =
>discrepances =3D
>>>with external time sources in the form of sistematic offsets to  the =
>>>clocks outside.  When you lose a local time source, your server gets =
>the =3D
>>>time reference from the couple of best stratum clocks disponible and =
>can =3D
>>>skip even more than a second.
>>>If you need reliability and precission, you can mount two or more =3D
>>>servers with gps timesources and connect them over ethernet media.  =
>>>You'll have better than 1ms on the whole local net if you use PPS =3D
>> That is the machine that has the pps ( from a GPS 18LVM) signal. that =
>> dropped for some reason, and then there seemed to be a discrepancy of =
>> half a sec between tick.usask.ca, and the other three level 2 or 3 ntp
>> sources. I realise that ntp would assume that the majority rules, but =
>> majority was wrong ( as seen on all of the other systems who got their =
>> from that system by chrony. They all suddenly saw a half sec jump in =
>> time-- ie they suddenly found themselves with a .48 sec offset).
>> So I did everything I thought I could to get reliability and =
>precision, and
>> instead got a half second error.

>not, suppose you have an assymetric roundtrip to tick.usask.ca, due to a =
>long downloading being done at your site. If the downloading is at your =
>site, you'll have the same assymetric roundtrip to all the remote =
>servers configured in your ntpd (and the same conditions apply for the =
>three servers you post).

Except there was no large download, and this situation seems to have
continued for about 30 min.

>you'll have long delays in the frames that come to you, but short in the =
>frames you send to your time servers (the same for the three ones).  NTP =
>cannot assume how much time is wasted in the comming path and how much =
>in the going path, so it assumes a typicall case of 50% of waste in =
>either direction in the calculus of the offsets.

Of course. But all four sites should have been the same. 

>The result is (read the NTP Reques For Comments document for an =
>explanation) that you measure false offsets from these servers (and =
>actually the ***same*** false offsets).

>You must consider the absolute error in this case (you are indeed in a =
>worst case, as all errors add without compensation) which is the root =
>distance plus the root dispersion, and you will see that it is in the =
>order of your measured offset.

>> Note that tick.usask.ca typically has a .2 ms offset, with a 40ms

>you say typically, but what happens in the case you posted?

Unfortunately not enough information was preserved to let me know exactly
what happened. 

>The main reason of using a local source of time is getting better that =
>half a second on internet. .5s. is a typicall situation on a loaded line =
>over internet.

>> roundtrip.
>> The others were pool sources.

>Another source of systematic errors is the erroneous supposition that =
>there is no delay in the interval that you get the PPS interrupt and the =
>timestamp obtained in the kernel to get the clock offset.  But this =
>affects in the order of microseconds, not milliseconds.  Consider that =
>if you are interfacing a TTL PPS signal over a TTL to RS232 levels =
>conversor, you are lossing several microseconds in the conversor gates.  =

The pps goes into a parallel port with a direct parallel port interrupt
dedicated to the pps, whose only purpose is to timestamp it. I altered the
shm driver to read these timestamps ( put out on a /dev/gpsint interface)
and do the shm averaging. So these interrupt delays should be minimal. And
that is what I see. HOwever in this case, pps was lost for some reason for
about half an hour. 

>You have another systematic delay in the interval that goes between the =
>PPS interrupt and the time the kernel makes a timestamp of the event.  =
>This offset can be variable if the interrupt is not of high priority or =
>the CPU caches instructions in a high speed memory (suppose a worst case =
>of a kernel that pages interrupt code to disk, which produces a page =
>fault when the PPS arrives, not allowing the kernel to timestamp that =
>event but to when the page is loaded from disk to memory)   This is not =
>the case on actual operating systems, but consider that normally PPS =
>signals get feed to the kernel over RS232 lines (slow lines at low =
>priority interrupts)=20

>What version of ntpd are you using?

>What kind of PPS are you using?
GPS 18LVM to the parallel port, handled by a dedicated interrupt service

>How are you interfacing PPS signal to the kernel?=20

The timestamps on the incoming pps and written to a /dev interface, which
is read by the shm ntp driver, which then filters them and sends them to
the ntp.

>Are you actually interfacing the PPS signal to the kernel?

No. They go via the shm clock driver.

>Is your kernel adecuatelly configured to use the PPS signal?
>What Operating System are you using?


>What version?

Mandriva 2007.1 (kernel 2.6.17-16mdv)

The typical scatter in the clock offset read by ntp is .3usec (see
www.theory.physics.ubc.ca/chrony/chrony.html-- the last graph is of the ntp
offsets of the clock as produced our of the shm driver.- the anomaly at Jan
9.1 was while I was playing with the ntp and the clock.)

>>>"Unruh" <unruh-spam at physics.ubc.ca> escribi=3DF3 en el mensaje =3D
>>>news:FMfej.53259$5l3.36002 at edtnps82...
>>>>I have a very weird situation. I am running a GPS PPS (Garmin =
>>>> with a few machines as a backup/initialization.=3D20
>>>> Sudeenly for about half and hour, my GPS failed for some reason ( =
>>>still do
>>>> not know what was wrong since it had come back on air by the time I =
>>>> something wrong). Every hour I run a ntpq -p just to check that my =
>gps =3D
>>>> on air. I got this report.
>>>>     remote           refid      st t when poll reach   delay   =
>offset =3D
>>>> =3D
>>>=3D3D =3D3D=3D3D
>>>> xtick.usask.ca   .GPS.            1 u 1003 1024  377   44.954    =
>0.213 =3D
>>>> +sanrail.com      2 u  993 1024  377    1.486  =
>-479.03 =3D
>>>> +raptor.tera-byt    2 u  322 1024  377   17.295  =
>-480.35 =3D
>>>> *zeus.yocum.org    2 u  390 1024  377   70.415  =
>-481.02 =3D
>>>> SHM(0)          .PPS.            0 l 1415   16    0    0.000   =
>-0.002 =3D
>>>> Now I believe the tick.usask.ca result, since all of the machines =
>>>which use
>>>> mine as a source suddenly noticed a .48 second jump when my GPS =3D
>>>failed. But
>>>> why in the world would three systems all suddenly be out by .48 =
>>>> Doing a peers on them, one has a GPS as its source, one a .WWVB. and =
>>>one an
>>>> .ACTS. Why should all three suddenly be out by half a second?

More information about the questions mailing list