[ntp:questions] ntpd sometimes goes wacko, all peers in INIT, packets mostly ignored, but restart fixes it
Alexander Perlis
aprl at math.arizona.edu
Wed Oct 3 20:39:24 UTC 2007
> You haven't really provide any hard information, but a wild guess
> would be misuse of local clock in an isolated time island.
Thank you for trying to help us. My apologies as a newbie as far as
investigating ntp trouble is concerned. This present post contains the
additional information you request. Your assistance and possible insight
is much appreciated.
To recall (and I hope clarify), we have a bunch of ntp clients with an
identical list of ntp servers. (My prior post was misleading in that I
referred to our machines as "servers". They are our webserver,
mailserver, etc. They are *not* time servers. So as far as ntp is
concerned, we have a bunch of identically-configured ntp *clients*.)
Sometimes, seemingly at random, we discover some of these boxes to have
incorrect time-of-day, and investigating ntpd on those boxes shows it to
be in a seemingly wacko state in which all peers show ".INIT.". Although
packets are occasionally sent and responses received, they are ignored.
If we do a "/etc/init.d/ntp restart" then everything is fine.
> To make a start, we need the contents of the ntp.conf file
Here it is. We started with the default Debian ntp.conf, but since our
university provides four on-campus ntp servers, we replaced the line
server pool.ntp.org
with four lines corresponding to our university time servers:
# /etc/ntp.conf
driftfile /var/lib/ntp/ntp.drift
statsdir /var/log/ntpstats/
statistics loopstats peerstats clockstats
filegen loopstats file loopstats type day enable
filegen peerstats file peerstats type day enable
filegen clockstats file clockstats type day enable
### This is the part we changed:
server ntp.arizona.edu
server ntp1.arizona.edu
server ntp2.arizona.edu
server ntp3.arizona.edu
###
restrict -4 default kod notrap nomodify nopeer noquery
restrict -6 default kod notrap nomodify nopeer noquery
restrict 127.0.0.1
restrict ::1
# end-of-file
> and the result of the ntpq peers command.
On a wacko ntpd:
ntpq> peers
remote refid st t when poll reach delay offset
jitter
==============================================================================
Hopey.Telcom.Ar .INIT. 16 - - 1024 0 0.000 0.000
0.000
palm.CCIT.Arizo .INIT. 16 - - 1024 0 0.000 0.000
0.000
news.ccit.arizo .INIT. 16 - - 1024 0 0.000 0.000
0.000
grumpy.CCIT.Ari .INIT. 16 - - 1024 0 0.000 0.000
0.000
On a non-wacko ntpd:
ntpq> peers
remote refid st t when poll reach delay offset
jitter
==============================================================================
-Hopey.Telcom.Ar 204.123.2.72 2 u 277 1024 377 0.501 -3.381
0.892
+palm.CCIT.Arizo 192.43.244.18 2 u 845 1024 377 0.692 0.483
0.283
+news.ccit.arizo 128.196.133.233 3 u 67 1024 377 1.126 1.039
3.293
*grumpy.CCIT.Ari 128.196.120.19 2 u 318 1024 377 6.340 4.147
4.295
> As you seem to be rejecting responses, the output of ntpq's rv
> command for each server association is also likely to be useful,
> as it should contain the reject reason.
ntpq> associations
ind assID status conf reach auth condition last_event cnt
===========================================================
1 39197 8000 yes yes none reject
2 39198 8000 yes yes none reject
3 39199 8000 yes yes none reject
4 39200 8000 yes yes none reject
ntpq> rv 39197
assID=39197 status=8000 unreach, conf, no events,
srcadr=Hopey.Telcom.Arizona.EDU, srcport=123, dstadr=0.0.0.0,
dstport=123, leap=11, stratum=16, precision=-20, rootdelay=0.000,
rootdispersion=0.000, refid=INIT, reach=000, unreach=12420, hmode=3,
pmode=0, hpoll=10, ppoll=10, flash=00 ok, keyid=0, ttl=0, offset=0.000,
delay=0.000, dispersion=15937.500, jitter=0.000,
reftime=00000000.00000000 Wed, Feb 6 2036 23:28:16.000,
org=00000000.00000000 Wed, Feb 6 2036 23:28:16.000,
rec=00000000.00000000 Wed, Feb 6 2036 23:28:16.000,
xmt=caae6ece.f93ad5aa Wed, Oct 3 2007 12:41:02.973,
filtdelay= 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00,
filtoffset= 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00,
filtdisp= 16000.0 16000.0 16000.0 16000.0 16000.0 16000.0 16000.0 16000.0
ntpq> rv 39198
assID=39198 status=8000 unreach, conf, no events,
srcadr=palm.CCIT.Arizona.EDU, srcport=123, dstadr=0.0.0.0, dstport=123,
leap=11, stratum=16, precision=-20, rootdelay=0.000,
rootdispersion=0.000, refid=INIT, reach=000, unreach=12420, hmode=3,
pmode=0, hpoll=10, ppoll=10, flash=00 ok, keyid=0, ttl=0, offset=0.000,
delay=0.000, dispersion=15937.500, jitter=0.000,
reftime=00000000.00000000 Wed, Feb 6 2036 23:28:16.000,
org=00000000.00000000 Wed, Feb 6 2036 23:28:16.000,
rec=00000000.00000000 Wed, Feb 6 2036 23:28:16.000,
xmt=caae6f45.f935c233 Wed, Oct 3 2007 12:43:01.973,
filtdelay= 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00,
filtoffset= 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00,
filtdisp= 16000.0 16000.0 16000.0 16000.0 16000.0 16000.0 16000.0 16000.0
ntpq> rv 39199
assID=39199 status=8000 unreach, conf, no events,
srcadr=news.ccit.arizona.edu, srcport=123, dstadr=0.0.0.0, dstport=123,
leap=11, stratum=16, precision=-20, rootdelay=0.000,
rootdispersion=0.000, refid=INIT, reach=000, unreach=12420, hmode=3,
pmode=0, hpoll=10, ppoll=10, flash=00 ok, keyid=0, ttl=0, offset=0.000,
delay=0.000, dispersion=15937.500, jitter=0.000,
reftime=00000000.00000000 Wed, Feb 6 2036 23:28:16.000,
org=00000000.00000000 Wed, Feb 6 2036 23:28:16.000,
rec=00000000.00000000 Wed, Feb 6 2036 23:28:16.000,
xmt=caae6e9a.f9a35ab7 Wed, Oct 3 2007 12:40:10.975,
filtdelay= 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00,
filtoffset= 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00,
filtdisp= 16000.0 16000.0 16000.0 16000.0 16000.0 16000.0 16000.0 16000.0
ntpq> rv 39200
assID=39200 status=8000 unreach, conf, no events,
srcadr=grumpy.CCIT.Arizona.EDU, srcport=123, dstadr=0.0.0.0,
dstport=123, leap=11, stratum=16, precision=-20, rootdelay=0.000,
rootdispersion=0.000, refid=INIT, reach=000, unreach=12420, hmode=3,
pmode=0, hpoll=10, ppoll=10, flash=00 ok, keyid=0, ttl=0, offset=0.000,
delay=0.000, dispersion=15937.500, jitter=0.000,
reftime=00000000.00000000 Wed, Feb 6 2036 23:28:16.000,
org=00000000.00000000 Wed, Feb 6 2036 23:28:16.000,
rec=00000000.00000000 Wed, Feb 6 2036 23:28:16.000,
xmt=caae6f2c.f9cb711a Wed, Oct 3 2007 12:42:36.975,
filtdelay= 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00,
filtoffset= 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00,
filtdisp= 16000.0 16000.0 16000.0 16000.0 16000.0 16000.0 16000.0 16000.0
What's perhaps strange here is that it says "unreach" and has
"dstadr=0.0.0.0". The non-wacko ntpds have "reach" and
"dstadr=<IP_Address_of_Server>". Yet the wacko ntpds *are* communicating
with the servers (see below).
> The tcpdump trace might also be helpful.
This tcpdump on a machine with a wacko ntpd seems to indicate (to me
at least) that ntpd is indeed getting packets to the peers and the peers
are indeed responding. (This corroborates the fact that the ignored
packet count inside ntpd is increasing, as seen from ntpdc iostats.)
webmonkey:~# tcpdump -X -s 0 -vvv dst port 123
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size
65535 bytes
13:14:20.975507 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto:
UDP (17), length: 76) webmonkey.math.arizona.edu.ntp >
news.ccit.arizona.edu.ntp: [bad udp cksum 5e7e!] NTPv4, length 48
Client, Leap indicator: clock unsynchronized (192), Stratum 0,
poll 10s, precision -20
Root Delay: 0.000000, Root dispersion: 190.515075,
Reference-ID: (unspec)
Reference Timestamp: 0.000000000
Originator Timestamp: 0.000000000
Receive Timestamp: 0.000000000
Transmit Timestamp: 3400431260.975470633 (2007/10/03 13:14:20)
Originator - Receive Timestamp: 0.000000000
Originator - Transmit Timestamp: 3400431260.975470633
(2007/10/03 13:14:20)
0x0000: 4500 004c 0000 4000 4011 d38a 80c4 e0dc E..L.. at .@.......
0x0010: 80c4 84b1 007b 007b 0038 6760 e300 0aec .....{.{.8g`....
0x0020: 0000 0000 00be 83dc 494e 4954 0000 0000 ........INIT....
0x0030: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0x0040: 0000 0000 caae 769c f9b8 7184 ......v...q.
13:14:20.976650 IP (tos 0x0, ttl 252, id 48563, offset 0, flags [DF],
proto: UDP (17), length: 76) news.ccit.arizona.edu.ntp >
webmonkey.math.arizona.edu.ntp: [udp sum ok] NTPv4, length 48
Server, Leap indicator: (0), Stratum 3, poll 10s, precision -16
Root Delay: 0.166061, Root dispersion: 0.018081, Reference-ID:
64.254.225.246
Reference Timestamp: 3400432569.483190536 (2007/10/03 13:36:09)
Originator Timestamp: 3400431260.975470633 (2007/10/03 13:14:20)
Receive Timestamp: 3400432606.929923057 (2007/10/03 13:36:46)
Transmit Timestamp: 3400432606.930152893 (2007/10/03 13:36:46)
Originator - Receive Timestamp: +1345.954452395
Originator - Transmit Timestamp: +1345.954682230
0x0000: 4500 004c bdb3 4000 fc11 59d6 80c4 84b1 E..L.. at ...Y.....
0x0010: 80c4 e0dc 007b 007b 0038 ee76 2403 0af0 .....{.{.8.v$...
0x0020: 0000 2a83 0000 04a1 40fe e1f6 caae 7bb9 ..*..... at .....{.
0x0030: 7bb2 6000 caae 769c f9b8 7184 caae 7bde {.`...v...q...{.
0x0040: ee0f 7000 caae 7bde ee1e 8000 ..p...{.....
13:15:13.971854 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto:
UDP (17), length: 76) webmonkey.math.arizona.edu.ntp >
Hopey.Telcom.Arizona.EDU.ntp: [bad udp cksum b389!] NTPv4, length 48
Client, Leap indicator: clock unsynchronized (192), Stratum 0,
poll 10s, precision -20
Root Delay: 0.000000, Root dispersion: 190.515884,
Reference-ID: (unspec)
Reference Timestamp: 0.000000000
Originator Timestamp: 0.000000000
Receive Timestamp: 0.000000000
Transmit Timestamp: 3400431313.971838847 (2007/10/03 13:15:13)
Originator - Receive Timestamp: 0.000000000
Originator - Transmit Timestamp: 3400431313.971838847
(2007/10/03 13:15:13)
0x0000: 4500 004c 0000 4000 4011 d751 80c4 e0dc E..L.. at .@..Q....
0x0010: 80c4 80ea 007b 007b 0038 6399 e300 0aec .....{.{.8c.....
0x0020: 0000 0000 00be 8411 494e 4954 0000 0000 ........INIT....
0x0030: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0x0040: 0000 0000 caae 76d1 f8ca 6e41 ......v...nA
13:15:13.972514 IP (tos 0x0, ttl 61, id 0, offset 0, flags [DF], proto:
UDP (17), length: 76) Hopey.Telcom.Arizona.EDU.ntp >
webmonkey.math.arizona.edu.ntp: [udp sum ok] NTPv4, length 48
Server, Leap indicator: (0), Stratum 2, poll 10s, precision -20
Root Delay: 0.047073, Root dispersion: 0.052352, Reference-ID:
usno.pa-x.dec.com
Reference Timestamp: 3400432653.882398999 (2007/10/03 13:37:33)
Originator Timestamp: 3400431313.971838847 (2007/10/03 13:15:13)
Receive Timestamp: 3400432659.922692999 (2007/10/03 13:37:39)
Transmit Timestamp: 3400432659.922714999 (2007/10/03 13:37:39)
Originator - Receive Timestamp: +1345.950854182
Originator - Transmit Timestamp: +1345.950876176
0x0000: 4500 004c 0000 4000 3d11 da51 80c4 80ea E..L.. at .=..Q....
0x0010: 80c4 e0dc 007b 007b 0038 bdb2 2402 0aec .....{.{.8..$...
0x0020: 0000 0c0d 0000 0d67 cc7b 0248 caae 7c0d .......g.{.H..|.
0x0030: e1e4 e69f caae 76d1 f8ca 6e41 caae 7c13 ......v...nA..|.
0x0040: ec35 9bc3 caae 7c13 ec37 0cdc .5....|..7..
I remain clueless as to why our ntpd gets into this wacko state and
ignores these response packets? Can someone spot the problem?
> Specifying the same external servers is bad, because it means there
> is no diversity and also puts an unnecessary load on the servers.
Given the additional info at the top of this post, does this comment
still apply to us? I'd love to understand this stuff better and want to
be sure we're doing the right thing. I figured that by replacing
server pool.ntp.org
with
server ntp.arizona.edu
server ntp1.arizona.edu
server ntp2.arizona.edu
server ntp3.arizona.edu
we'd take load off the world-wide pool and instead hit the four
on-campus stratum 2 servers, which are provided by the campus-wide
computing group expressly for use by our campus.
Thanks.
Alexander Perlis
Computer Support, Dept of Mathematics, University of Arizona
More information about the questions
mailing list