[ntp:questions] ntpd losing sync

A C agcarver+ntp at acarver.net
Sat Feb 4 08:33:07 UTC 2012


Ok, I thought this was a one-off problem but I've had ntpd lose sync 
again after about four days from a restart.  It never regains sync.

It starts with what seems to be the system clock drifting away from the 
PPS lock and then the oscillations from corrections are just too great 
and the whole thing blows up.


Here's the current configuration for version 4.2.7p236:

server          0.us.pool.ntp.org minpoll 9 iburst
server          1.us.pool.ntp.org minpoll 9 iburst
server          0.north-america.pool.ntp.org minpoll 9 iburst
server ntp1.gatech.edu prefer minpoll 9
server rolex.usg.edu minpoll 9
server  127.127.22.0  minpoll 2 maxpoll 4
fudge   127.127.22.0  time1 +0.000 flag2 1 flag3 1 refid PPS
server  127.127.28.0  minpoll 7 noselect
fudge   127.127.28.0  time1 -0.6 refid GPSD


The peer list after waiting about a day from the initial system upset:

       remote           refid      st t when poll reach   delay   offset 
  jitter
 
==============================================================================
  x127.127.22.0    .PPS.            0 l    -   16  377    0.000  -465.49 
355.933
   127.127.28.0    .GPSD.           0 l    -  128  377    0.000  -208986 
2833.87
   207.7.148.214   216.218.254.202  2 u    -  512  377  1045.07  -209713 
11784.0
   72.14.179.211   127.67.113.92    2 u    -  512  377  1029.80  -201710 
6559.37
   173.255.224.22  128.4.1.1        2 u  245  512  377  919.628  -202629 
7684.05
   130.207.165.28  130.207.244.240  2 u    -  512  377  994.543  -204125 
7778.28
   131.144.4.10    65.212.71.102    2 u   23  512  377  1000.21  -203648 
7687.63

Note that the offset for PPS is swinging wildly, not exactly visible in 
this static snapshot.

ntpq associations:
ind assid status  conf reach auth condition  last_event cnt
===========================================================
   1  4560  912a   yes   yes  none falsetick    sys_peer  2
   2  4561  9014   yes   yes  none    reject   reachable  1
   3  4562  9014   yes   yes  none    reject   reachable  1
   4  4563  9034   yes   yes  none    reject   reachable  3
   5  4564  9014   yes   yes  none    reject   reachable  1
   6  4565  904a   yes   yes  none    reject    sys_peer  4
   7  4566  9014   yes   yes  none    reject   reachable  1

rv 4560 (first sys_peer):
  associd=4560 status=912a conf, reach, sel_falsetick, 2 events, sys_peer,
  srcadr=PPS(0), srcport=123, dstadr=127.0.0.1, dstport=123, leap=00,
  stratum=0, precision=-20, rootdelay=0.000, rootdisp=0.000, refid=PPS,
  reftime=d2d76400.c9b870fd  Sat, Feb  4 2012  8:00:00.787,
  rec=d2d76401.ffffffff  Sat, Feb  4 2012  8:00:02.000, reach=377,
  unreach=0, hmode=3, pmode=4, hpoll=4, ppoll=4, headway=0, flash=00 ok,
  keyid=0, offset=259.524, delay=0.000, dispersion=4.956, jitter=444.467,
  filtdelay=     0.00    0.00    0.00    0.00    0.00    0.00    0.00 
  0.00,
  filtoffset=  259.52  344.53  419.52  474.51 -430.48 -335.49 -265.48 
-185.49,
  filtdisp=      4.74    4.98    5.22    5.47    5.70    5.94    6.18 
  6.42

rv 4565 (second sys_peer)
  associd=4565 status=904a conf, reach, sel_reject, 4 events, sys_peer,
  srcadr=ntp1.gatech.edu, srcport=123, dstadr=10.0.0.21, dstport=123,
  leap=00, stratum=2, precision=-20, rootdelay=0.565, rootdisp=24.597,
  refid=130.207.244.240,
  reftime=d2d7609d.0646422f  Sat, Feb  4 2012  7:45:33.024,
  rec=d2d76271.00c7dd3a  Sat, Feb  4 2012  7:53:21.003, reach=377,
  unreach=0, hmode=3, pmode=4, hpoll=9, ppoll=9, headway=46,
  flash=400 peer_dist, keyid=0, offset=-204125.520, delay=994.543,
  dispersion=16.941, jitter=7778.280,
  filtdelay=   997.29  999.05  994.54  996.13  994.70  994.38  977.68 
995.78,
  filtoffset= -209351 -206700 -204125 -201435 -198758 -196080 -193475 
-190882,
  filtdisp=      0.08    8.07   15.83   23.94   32.01   40.08   47.91 
55.76


I can provide graphs of the offset, dispersion and skew for any of the 
peers if anyone wants them.  The physical GPS itself has been ticking 
just fine, no apparent issues with its signal to the machine.  As far as 
I can tell from the peers files there is simply a sudden shift away from 
a nominal few microseconds of offset for the reported PPS.  The offset 
then swings wildly (like a PID loop in oscillation) until I restart ntpd 
and the system clock is stabilized.

The system sits quietly in a corner of the room.  It has no duties other 
than to run ntpd and gpsd.  Whatever monitoring I do is run on other 
systems (ntpd is polled remotely with ntpq on another system, gpsd 
status is queried remotely by another system and compiled there).  The 
oscillations happen after a few days but no obvious cron jobs are 
running at the times that they start.  If there's something I can do to 
instrument ntpd further I can do that and see if I catch the problem.


More information about the questions mailing list