[ntp:questions] Bizzare half second disagreement between ntp hosts
unruh-spam at physics.ubc.ca
Fri Jan 11 19:19:17 UTC 2008
"Luis Colorado" <luis.colorado at hispalinux.es> writes:
>"Unruh" <unruh-spam at physics.ubc.ca> escribi=F3 en el mensaje =
>news:aVChj.26825$fj2.1600 at edtnps82...
>> "Luis Colorado" <luis.colorado at hispalinux.es> writes:
>>>Just suppose you are downloading some great file what creates =
>>>in the data flow bandwidth. When that happens, you observe =
>>>with external time sources in the form of sistematic offsets to the =
>>>clocks outside. When you lose a local time source, your server gets =
>>>time reference from the couple of best stratum clocks disponible and =
>>>skip even more than a second.
>>>If you need reliability and precission, you can mount two or more =3D
>>>servers with gps timesources and connect them over ethernet media. =
>>>You'll have better than 1ms on the whole local net if you use PPS =3D
>> That is the machine that has the pps ( from a GPS 18LVM) signal. that =
>> dropped for some reason, and then there seemed to be a discrepancy of =
>> half a sec between tick.usask.ca, and the other three level 2 or 3 ntp
>> sources. I realise that ntp would assume that the majority rules, but =
>> majority was wrong ( as seen on all of the other systems who got their =
>> from that system by chrony. They all suddenly saw a half sec jump in =
>> time-- ie they suddenly found themselves with a .48 sec offset).
>> So I did everything I thought I could to get reliability and =
>> instead got a half second error.
>not, suppose you have an assymetric roundtrip to tick.usask.ca, due to a =
>long downloading being done at your site. If the downloading is at your =
>site, you'll have the same assymetric roundtrip to all the remote =
>servers configured in your ntpd (and the same conditions apply for the =
>three servers you post).
Except there was no large download, and this situation seems to have
continued for about 30 min.
>you'll have long delays in the frames that come to you, but short in the =
>frames you send to your time servers (the same for the three ones). NTP =
>cannot assume how much time is wasted in the comming path and how much =
>in the going path, so it assumes a typicall case of 50% of waste in =
>either direction in the calculus of the offsets.
Of course. But all four sites should have been the same.
>The result is (read the NTP Reques For Comments document for an =
>explanation) that you measure false offsets from these servers (and =
>actually the ***same*** false offsets).
>You must consider the absolute error in this case (you are indeed in a =
>worst case, as all errors add without compensation) which is the root =
>distance plus the root dispersion, and you will see that it is in the =
>order of your measured offset.
>> Note that tick.usask.ca typically has a .2 ms offset, with a 40ms
>you say typically, but what happens in the case you posted?
Unfortunately not enough information was preserved to let me know exactly
>The main reason of using a local source of time is getting better that =
>half a second on internet. .5s. is a typicall situation on a loaded line =
>> The others were pool sources.
>Another source of systematic errors is the erroneous supposition that =
>there is no delay in the interval that you get the PPS interrupt and the =
>timestamp obtained in the kernel to get the clock offset. But this =
>affects in the order of microseconds, not milliseconds. Consider that =
>if you are interfacing a TTL PPS signal over a TTL to RS232 levels =
>conversor, you are lossing several microseconds in the conversor gates. =
The pps goes into a parallel port with a direct parallel port interrupt
dedicated to the pps, whose only purpose is to timestamp it. I altered the
shm driver to read these timestamps ( put out on a /dev/gpsint interface)
and do the shm averaging. So these interrupt delays should be minimal. And
that is what I see. HOwever in this case, pps was lost for some reason for
about half an hour.
>You have another systematic delay in the interval that goes between the =
>PPS interrupt and the time the kernel makes a timestamp of the event. =
>This offset can be variable if the interrupt is not of high priority or =
>the CPU caches instructions in a high speed memory (suppose a worst case =
>of a kernel that pages interrupt code to disk, which produces a page =
>fault when the PPS arrives, not allowing the kernel to timestamp that =
>event but to when the page is loaded from disk to memory) This is not =
>the case on actual operating systems, but consider that normally PPS =
>signals get feed to the kernel over RS232 lines (slow lines at low =
>What version of ntpd are you using?
>What kind of PPS are you using?
GPS 18LVM to the parallel port, handled by a dedicated interrupt service
>How are you interfacing PPS signal to the kernel?=20
The timestamps on the incoming pps and written to a /dev interface, which
is read by the shm ntp driver, which then filters them and sends them to
>Are you actually interfacing the PPS signal to the kernel?
No. They go via the shm clock driver.
>Is your kernel adecuatelly configured to use the PPS signal?
>What Operating System are you using?
Mandriva 2007.1 (kernel 2.6.17-16mdv)
The typical scatter in the clock offset read by ntp is .3usec (see
www.theory.physics.ubc.ca/chrony/chrony.html-- the last graph is of the ntp
offsets of the clock as produced our of the shm driver.- the anomaly at Jan
9.1 was while I was playing with the ntp and the clock.)
>>>"Unruh" <unruh-spam at physics.ubc.ca> escribi=3DF3 en el mensaje =3D
>>>news:FMfej.53259$5l3.36002 at edtnps82...
>>>>I have a very weird situation. I am running a GPS PPS (Garmin =
>>>> with a few machines as a backup/initialization.=3D20
>>>> Sudeenly for about half and hour, my GPS failed for some reason ( =
>>>> not know what was wrong since it had come back on air by the time I =
>>>> something wrong). Every hour I run a ntpq -p just to check that my =
>>>> on air. I got this report.
>>>> remote refid st t when poll reach delay =
>>>> xtick.usask.ca .GPS. 1 u 1003 1024 377 44.954 =
>>>> +sanrail.com 188.8.131.52 2 u 993 1024 377 1.486 =
>>>> +raptor.tera-byt 184.108.40.206 2 u 322 1024 377 17.295 =
>>>> *zeus.yocum.org 220.127.116.11 2 u 390 1024 377 70.415 =
>>>> SHM(0) .PPS. 0 l 1415 16 0 0.000 =
>>>> Now I believe the tick.usask.ca result, since all of the machines =
>>>> mine as a source suddenly noticed a .48 second jump when my GPS =3D
>>>> why in the world would three systems all suddenly be out by .48 =
>>>> Doing a peers on them, one has a GPS as its source, one a .WWVB. and =
>>>> .ACTS. Why should all three suddenly be out by half a second?
More information about the questions