[ntp:questions] NTP losing connection to SHM driver

Jörg Neulist ka-cj6 at gmx.de
Wed Feb 8 17:45:04 UTC 2012


Hello dear NTP experts,

if you would like to tackle some strange (and as far as I can tell
undocumented) behaviour, here I go:
I am using ntpd 4.2.6p3 on an embedded system. On this box, the RTC is
far more stable than the system clock. Therefore I have written an
interface to the SHM driver to make the RTC available to ntp. But,
strangely, the ntp connection seems to be reset every once in a while.
The reach drops to 0 immediately, like it would if ntpd is restarted
(which I know for sure it is not).
 It does not happen regularly, but rather often. I have tried several
combinations:
a) Setting the SHM-driver every 15minutes (cron), and
minpoll=maxpoll=10
b) Setting the SHM-driver every 5minutes (cron), and
minpoll=maxpoll=10
c) Setting the SHM-driver every 30s (demon), and no change to the
polling intervals.

In test c) it seemed to lose connection at least twice per hour, but
could reestablish it rather quickly. In scenario a) and b) it did not
happen that often, but lasted much longer (no exact data here).

One thing I noted in test b): It looked like the SHM-driver was
unreachable, the "when"-counter counted well past "poll" - and then
"reach" dropped to zero immediately. The strange thing: It appeared to
happen, when the last successful poll (according to "when")
encountered very fresh data (i.e. usually it came a few seconds after
the cron job who wrote to the shm driver). Why would this make the
NEXT attempt fail?

Further testing showed me, that the driver does set a "reach" bit to
zero, if at the completion of a poll interval it does not encounter
new data, i.e. if the shm segment has not been written since its last
poll. So this is not what is happening here.

There is something very wrong here, either with my understanding, or
with the driver itself. Could somebody enlighten me?

Thankfully yours,
Jörg


An excerpt (grepped from ntpq output saved every five minutes):

2012-01-12-223501.report:*127.127.28.1    .RTC.           12 l   60
64    3    0.000  -19.200   9.692
2012-01-12-224001.report:*127.127.28.1    .RTC.           12 l   41
64  177    0.000  926.048 574.127
2012-01-12-224501.report:*127.127.28.1    .RTC.           12 l   20
64  377    0.000  849.446  69.440
2012-01-12-225001.report: 127.127.28.1    .RTC.           12 l   18
64    0    0.000    0.000   0.002

You can see that there is no intermediate reach of 340 or something
like that, which one would expect; it looks more like a complete
reset. The full ntpq variables output at 22:45 and 22:50 look like
this:

Internal:
associd=0 status=0413 leap_none, sync_uhf_radio, 1 event,
spike_detect,
version="ntpd 4.2.6p3 at 1.2290 Tue Jan 10 13:39:46 UTC 2012 (1)",
processor="armv5tel", system="Linux/2.6.37.6+", leap=00, stratum=13,
precision=-19, rootdelay=0.000, rootdisp=923.530, refid=SHM(1),
reftime=d2b9d270.aa85bff7  Thu, Jan 12 2012 22:43:12.666,
clock=d2b9d284.489d3baf  Thu, Jan 12 2012 22:43:32.283, peer=30840,
tc=6,
mintc=3, offset=-19.200, frequency=-500.000, sys_jitter=69.440,
clk_jitter=4.651, clk_wander=67.436


Timeserver #1 (30840):
associd=30840 status=9614 conf, reach, sel_sys.peer, 1 event,
reachable,
srcadr=127.127.28.1, srcport=123, dstadr=127.0.0.1, dstport=123,
leap=00,
stratum=12, precision=0, rootdelay=0.000, rootdisp=0.000, refid=RTC,
reftime=d2b9d256.42d70e55  Thu, Jan 12 2012 22:42:46.261,
rec=d2b9d270.aa85bff7  Thu, Jan 12 2012 22:43:12.666, reach=377,
unreach=0, hmode=3, pmode=4, hpoll=6, ppoll=6, headway=0, flash=00 ok,
keyid=0, offset=849.446, delay=0.000, dispersion=4.359, jitter=69.440,
filtdelay=     0.00    0.00    0.00    0.00    0.00    0.00    0.00
0.00,
filtoffset=  849.45  858.93  886.08  905.60  915.25  926.05  941.68
951.73,
filtdisp=      3.38    4.37    5.53    6.34    7.10    9.97    9.16
10.68


After the "disconnect":

Internal:
associd=0 status=c414 leap_alarm, sync_uhf_radio, 1 event, freq_mode,
version="ntpd 4.2.6p3 at 1.2290 Tue Jan 10 13:39:46 UTC 2012 (1)",
processor="armv5tel", system="Linux/2.6.37.6+", leap=11, stratum=16,
precision=-19, rootdelay=0.000, rootdisp=0.000, refid=STEP,
reftime=00000000.00000000  Thu, Feb  7 2036  7:28:16.000,
clock=d2b9d3b2.6c65d5b5  Thu, Jan 12 2012 22:48:34.423, peer=30840,
tc=6,
mintc=3, offset=0.000, frequency=-500.000, sys_jitter=0.002,
clk_jitter=0.002, clk_wander=67.436


Timeserver #1 (30840):
associd=30840 status=8014 conf, sel_reject, 1 event, reachable,
srcadr=127.127.28.1, srcport=123, dstadr=127.0.0.1, dstport=123,
leap=00,
stratum=12, precision=0, rootdelay=0.000, rootdisp=0.000, refid=RTC,
reftime=d2b9d3a0.51704fd5  Thu, Jan 12 2012 22:48:16.318,
rec=00000000.00000000  Thu, Feb  7 2036  7:28:16.000, reach=000,
unreach=0, hmode=3, pmode=4, hpoll=6, ppoll=6, headway=0,
flash=1000 peer_unreach, keyid=0, offset=0.000, delay=0.000,
dispersion=16000.000, jitter=0.002,
filtdelay=     0.00    0.00    0.00    0.00    0.00    0.00    0.00
0.00,
filtoffset=    0.00    0.00    0.00    0.00    0.00    0.00    0.00
0.00,
filtdisp=   16000.0 16000.0 16000.0 16000.0 16000.0 16000.0 16000.0
16000.0



More information about the questions mailing list