[ntp:questions] ntpd sometimes goes wacko, all peers in INIT, packets mostly ignored, but restart fixes it

Alexander Perlis aprl at math.arizona.edu
Wed Oct 3 20:39:24 UTC 2007


 > You haven't really provide any hard information, but a wild guess
 > would be misuse of local clock in an isolated time island.

Thank you for trying to help us. My apologies as a newbie as far as 
investigating ntp trouble is concerned. This present post contains the 
additional information you request. Your assistance and possible insight 
is much appreciated.

To recall (and I hope clarify), we have a bunch of ntp clients with an 
identical list of ntp servers. (My prior post was misleading in that I 
referred to our machines as "servers". They are our webserver, 
mailserver, etc. They are *not* time servers. So as far as ntp is 
concerned, we have a bunch of identically-configured ntp *clients*.)

Sometimes, seemingly at random, we discover some of these boxes to have 
incorrect time-of-day, and investigating ntpd on those boxes shows it to 
be in a seemingly wacko state in which all peers show ".INIT.". Although 
packets are occasionally sent and responses received, they are ignored. 
If we do a "/etc/init.d/ntp restart" then everything is fine.



 > To make a start, we need the contents of the ntp.conf file

Here it is. We started with the default Debian ntp.conf, but since our 
university provides four on-campus ntp servers, we replaced the line
    server pool.ntp.org
with four lines corresponding to our university time servers:

    # /etc/ntp.conf
    driftfile /var/lib/ntp/ntp.drift
    statsdir /var/log/ntpstats/
    statistics loopstats peerstats clockstats
    filegen loopstats file loopstats type day enable
    filegen peerstats file peerstats type day enable
    filegen clockstats file clockstats type day enable
    ### This is the part we changed:
    server ntp.arizona.edu
    server ntp1.arizona.edu
    server ntp2.arizona.edu
    server ntp3.arizona.edu
    ###
    restrict -4 default kod notrap nomodify nopeer noquery
    restrict -6 default kod notrap nomodify nopeer noquery
    restrict 127.0.0.1
    restrict ::1
    # end-of-file


 > and the result of the ntpq peers command.

On a wacko ntpd:
ntpq> peers
      remote           refid      st t when poll reach   delay   offset 
  jitter
==============================================================================
  Hopey.Telcom.Ar .INIT.          16 -    - 1024    0    0.000    0.000 
   0.000
  palm.CCIT.Arizo .INIT.          16 -    - 1024    0    0.000    0.000 
   0.000
  news.ccit.arizo .INIT.          16 -    - 1024    0    0.000    0.000 
   0.000
  grumpy.CCIT.Ari .INIT.          16 -    - 1024    0    0.000    0.000 
   0.000

On a non-wacko ntpd:
ntpq> peers
      remote           refid      st t when poll reach   delay   offset 
  jitter
==============================================================================
-Hopey.Telcom.Ar 204.123.2.72     2 u  277 1024  377    0.501   -3.381 
  0.892
+palm.CCIT.Arizo 192.43.244.18    2 u  845 1024  377    0.692    0.483 
  0.283
+news.ccit.arizo 128.196.133.233  3 u   67 1024  377    1.126    1.039 
  3.293
*grumpy.CCIT.Ari 128.196.120.19   2 u  318 1024  377    6.340    4.147 
  4.295



 > As you seem to be rejecting responses, the output of ntpq's rv
 > command for each server association is also likely to be useful,
 > as it should contain the reject reason.

ntpq> associations

ind assID status  conf reach auth condition  last_event cnt
===========================================================
   1 39197  8000   yes   yes  none    reject
   2 39198  8000   yes   yes  none    reject
   3 39199  8000   yes   yes  none    reject
   4 39200  8000   yes   yes  none    reject
ntpq> rv 39197
assID=39197 status=8000 unreach, conf, no events,
srcadr=Hopey.Telcom.Arizona.EDU, srcport=123, dstadr=0.0.0.0,
dstport=123, leap=11, stratum=16, precision=-20, rootdelay=0.000,
rootdispersion=0.000, refid=INIT, reach=000, unreach=12420, hmode=3,
pmode=0, hpoll=10, ppoll=10, flash=00 ok, keyid=0, ttl=0, offset=0.000,
delay=0.000, dispersion=15937.500, jitter=0.000,
reftime=00000000.00000000  Wed, Feb  6 2036 23:28:16.000,
org=00000000.00000000  Wed, Feb  6 2036 23:28:16.000,
rec=00000000.00000000  Wed, Feb  6 2036 23:28:16.000,
xmt=caae6ece.f93ad5aa  Wed, Oct  3 2007 12:41:02.973,
filtdelay=     0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00,
filtoffset=    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00,
filtdisp=   16000.0 16000.0 16000.0 16000.0 16000.0 16000.0 16000.0 16000.0
ntpq> rv 39198
assID=39198 status=8000 unreach, conf, no events,
srcadr=palm.CCIT.Arizona.EDU, srcport=123, dstadr=0.0.0.0, dstport=123,
leap=11, stratum=16, precision=-20, rootdelay=0.000,
rootdispersion=0.000, refid=INIT, reach=000, unreach=12420, hmode=3,
pmode=0, hpoll=10, ppoll=10, flash=00 ok, keyid=0, ttl=0, offset=0.000,
delay=0.000, dispersion=15937.500, jitter=0.000,
reftime=00000000.00000000  Wed, Feb  6 2036 23:28:16.000,
org=00000000.00000000  Wed, Feb  6 2036 23:28:16.000,
rec=00000000.00000000  Wed, Feb  6 2036 23:28:16.000,
xmt=caae6f45.f935c233  Wed, Oct  3 2007 12:43:01.973,
filtdelay=     0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00,
filtoffset=    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00,
filtdisp=   16000.0 16000.0 16000.0 16000.0 16000.0 16000.0 16000.0 16000.0
ntpq> rv 39199
assID=39199 status=8000 unreach, conf, no events,
srcadr=news.ccit.arizona.edu, srcport=123, dstadr=0.0.0.0, dstport=123,
leap=11, stratum=16, precision=-20, rootdelay=0.000,
rootdispersion=0.000, refid=INIT, reach=000, unreach=12420, hmode=3,
pmode=0, hpoll=10, ppoll=10, flash=00 ok, keyid=0, ttl=0, offset=0.000,
delay=0.000, dispersion=15937.500, jitter=0.000,
reftime=00000000.00000000  Wed, Feb  6 2036 23:28:16.000,
org=00000000.00000000  Wed, Feb  6 2036 23:28:16.000,
rec=00000000.00000000  Wed, Feb  6 2036 23:28:16.000,
xmt=caae6e9a.f9a35ab7  Wed, Oct  3 2007 12:40:10.975,
filtdelay=     0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00,
filtoffset=    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00,
filtdisp=   16000.0 16000.0 16000.0 16000.0 16000.0 16000.0 16000.0 16000.0
ntpq> rv 39200
assID=39200 status=8000 unreach, conf, no events,
srcadr=grumpy.CCIT.Arizona.EDU, srcport=123, dstadr=0.0.0.0,
dstport=123, leap=11, stratum=16, precision=-20, rootdelay=0.000,
rootdispersion=0.000, refid=INIT, reach=000, unreach=12420, hmode=3,
pmode=0, hpoll=10, ppoll=10, flash=00 ok, keyid=0, ttl=0, offset=0.000,
delay=0.000, dispersion=15937.500, jitter=0.000,
reftime=00000000.00000000  Wed, Feb  6 2036 23:28:16.000,
org=00000000.00000000  Wed, Feb  6 2036 23:28:16.000,
rec=00000000.00000000  Wed, Feb  6 2036 23:28:16.000,
xmt=caae6f2c.f9cb711a  Wed, Oct  3 2007 12:42:36.975,
filtdelay=     0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00,
filtoffset=    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00,
filtdisp=   16000.0 16000.0 16000.0 16000.0 16000.0 16000.0 16000.0 16000.0


What's perhaps strange here is that it says "unreach" and has 
"dstadr=0.0.0.0". The non-wacko ntpds have "reach" and 
"dstadr=<IP_Address_of_Server>". Yet the wacko ntpds *are* communicating 
with the servers (see below).



 > The tcpdump trace might also be helpful.

This tcpdump on a machine with a wacko ntpd seems to indicate (to me
at least) that ntpd is indeed getting packets to the peers and the peers 
are indeed responding. (This corroborates the fact that the ignored
packet count inside ntpd is increasing, as seen from ntpdc iostats.)

webmonkey:~# tcpdump -X -s 0 -vvv dst port 123
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 
65535 bytes
13:14:20.975507 IP (tos 0x0, ttl  64, id 0, offset 0, flags [DF], proto: 
UDP (17), length: 76) webmonkey.math.arizona.edu.ntp > 
news.ccit.arizona.edu.ntp: [bad udp cksum 5e7e!] NTPv4, length 48
         Client, Leap indicator: clock unsynchronized (192), Stratum 0, 
poll 10s, precision -20
         Root Delay: 0.000000, Root dispersion: 190.515075, 
Reference-ID: (unspec)
           Reference Timestamp:  0.000000000
           Originator Timestamp: 0.000000000
           Receive Timestamp:    0.000000000
           Transmit Timestamp:   3400431260.975470633 (2007/10/03 13:14:20)
             Originator - Receive Timestamp:  0.000000000
             Originator - Transmit Timestamp: 3400431260.975470633 
(2007/10/03 13:14:20)
         0x0000:  4500 004c 0000 4000 4011 d38a 80c4 e0dc  E..L.. at .@.......
         0x0010:  80c4 84b1 007b 007b 0038 6760 e300 0aec  .....{.{.8g`....
         0x0020:  0000 0000 00be 83dc 494e 4954 0000 0000  ........INIT....
         0x0030:  0000 0000 0000 0000 0000 0000 0000 0000  ................
         0x0040:  0000 0000 caae 769c f9b8 7184            ......v...q.
13:14:20.976650 IP (tos 0x0, ttl 252, id 48563, offset 0, flags [DF], 
proto: UDP (17), length: 76) news.ccit.arizona.edu.ntp > 
webmonkey.math.arizona.edu.ntp: [udp sum ok] NTPv4, length 48
         Server, Leap indicator:  (0), Stratum 3, poll 10s, precision -16
         Root Delay: 0.166061, Root dispersion: 0.018081, Reference-ID: 
64.254.225.246
           Reference Timestamp:  3400432569.483190536 (2007/10/03 13:36:09)
           Originator Timestamp: 3400431260.975470633 (2007/10/03 13:14:20)
           Receive Timestamp:    3400432606.929923057 (2007/10/03 13:36:46)
           Transmit Timestamp:   3400432606.930152893 (2007/10/03 13:36:46)
             Originator - Receive Timestamp:  +1345.954452395
             Originator - Transmit Timestamp: +1345.954682230
         0x0000:  4500 004c bdb3 4000 fc11 59d6 80c4 84b1  E..L.. at ...Y.....
         0x0010:  80c4 e0dc 007b 007b 0038 ee76 2403 0af0  .....{.{.8.v$...
         0x0020:  0000 2a83 0000 04a1 40fe e1f6 caae 7bb9  ..*..... at .....{.
         0x0030:  7bb2 6000 caae 769c f9b8 7184 caae 7bde  {.`...v...q...{.
         0x0040:  ee0f 7000 caae 7bde ee1e 8000            ..p...{.....
13:15:13.971854 IP (tos 0x0, ttl  64, id 0, offset 0, flags [DF], proto: 
UDP (17), length: 76) webmonkey.math.arizona.edu.ntp > 
Hopey.Telcom.Arizona.EDU.ntp: [bad udp cksum b389!] NTPv4, length 48
         Client, Leap indicator: clock unsynchronized (192), Stratum 0, 
poll 10s, precision -20
         Root Delay: 0.000000, Root dispersion: 190.515884, 
Reference-ID: (unspec)
           Reference Timestamp:  0.000000000
           Originator Timestamp: 0.000000000
           Receive Timestamp:    0.000000000
           Transmit Timestamp:   3400431313.971838847 (2007/10/03 13:15:13)
             Originator - Receive Timestamp:  0.000000000
             Originator - Transmit Timestamp: 3400431313.971838847 
(2007/10/03 13:15:13)
         0x0000:  4500 004c 0000 4000 4011 d751 80c4 e0dc  E..L.. at .@..Q....
         0x0010:  80c4 80ea 007b 007b 0038 6399 e300 0aec  .....{.{.8c.....
         0x0020:  0000 0000 00be 8411 494e 4954 0000 0000  ........INIT....
         0x0030:  0000 0000 0000 0000 0000 0000 0000 0000  ................
         0x0040:  0000 0000 caae 76d1 f8ca 6e41            ......v...nA
13:15:13.972514 IP (tos 0x0, ttl  61, id 0, offset 0, flags [DF], proto: 
UDP (17), length: 76) Hopey.Telcom.Arizona.EDU.ntp > 
webmonkey.math.arizona.edu.ntp: [udp sum ok] NTPv4, length 48
         Server, Leap indicator:  (0), Stratum 2, poll 10s, precision -20
         Root Delay: 0.047073, Root dispersion: 0.052352, Reference-ID: 
usno.pa-x.dec.com
           Reference Timestamp:  3400432653.882398999 (2007/10/03 13:37:33)
           Originator Timestamp: 3400431313.971838847 (2007/10/03 13:15:13)
           Receive Timestamp:    3400432659.922692999 (2007/10/03 13:37:39)
           Transmit Timestamp:   3400432659.922714999 (2007/10/03 13:37:39)
             Originator - Receive Timestamp:  +1345.950854182
             Originator - Transmit Timestamp: +1345.950876176
         0x0000:  4500 004c 0000 4000 3d11 da51 80c4 80ea  E..L.. at .=..Q....
         0x0010:  80c4 e0dc 007b 007b 0038 bdb2 2402 0aec  .....{.{.8..$...
         0x0020:  0000 0c0d 0000 0d67 cc7b 0248 caae 7c0d  .......g.{.H..|.
         0x0030:  e1e4 e69f caae 76d1 f8ca 6e41 caae 7c13  ......v...nA..|.
         0x0040:  ec35 9bc3 caae 7c13 ec37 0cdc            .5....|..7..



I remain clueless as to why our ntpd gets into this wacko state and 
ignores these response packets? Can someone spot the problem?



 > Specifying the same external servers is bad, because it means there
 > is no diversity and also puts an unnecessary load on the servers.

Given the additional info at the top of this post, does this comment 
still apply to us? I'd love to understand this stuff better and want to 
be sure we're doing the right thing. I figured that by replacing
    server pool.ntp.org
with
    server ntp.arizona.edu
    server ntp1.arizona.edu
    server ntp2.arizona.edu
    server ntp3.arizona.edu
we'd take load off the world-wide pool and instead hit the four 
on-campus stratum 2 servers, which are provided by the campus-wide 
computing group expressly for use by our campus.


Thanks.
Alexander Perlis
Computer Support, Dept of Mathematics, University of Arizona




More information about the questions mailing list