[ntp:questions] ntpd sometimes goes wacko, all peers in INIT, packets mostly ignored, but restart fixes it

Alexander Perlis aprl at math.arizona.edu
Mon Oct 1 17:13:49 UTC 2007


Greetings. Can someone tell me what to try next to figure out the 
following problem? We're stumped.

All our servers have identical ntpd configurations. Sometimes one of 
them (not always the same one) doesn't track the time. But if we restart 
ntpd, it works just fine again. I have no idea how to reproduce the 
problem directly --- I just have to wait until one of the servers shows 
incorrect time.

At this moment I have a server in the broken state. It has:

   'ntptime' returns code 5 (ERROR)

   'ntpq -c pe' says each peer is in state ".INIT."

   'ntpdc -c iostats' says:
        dropped packets: 0
        ignored packets: 84832
        received packets: 57
        packets sent: 48129
        packets not sent: 0

   If I restart ntp, which makes everything start working, then iostats 
tells me that ignored packet count is 0. In other words, most of our 
servers have ignored packets at 0. The fact that the wacko ntpd has a 
large ignored packet count hints that this must be where the problem 
lies. Why does ntpd sometimes go into a weird state and start ignoring 
the majority of packets?

   I turned on 'tcpdump -X -s 0 dst port 123' and can confirm that the 
wacko ntpd does periodically send UDP packets to the peers and that the 
peers immediately respond. Which makes sense, since the ignored packets 
count reported by iostats is indeed increasing over time.

   Furthermore, the output of 'netstat -avp' confirms that ntpd is 
listening on port 123. Well, that also makes sense.

   There are no errors from ntpd in the syslog.


   An strace of the broken ntpd shows the following repeatedly:

select(20, [16 17 18 19], NULL, NULL, NULL) = ? ERESTARTNOHAND (To be 
restarted)
--- SIGALRM (Alarm clock) @ 0 (0) ---
sigreturn()                             = ? (mask now [])
select(20, [16 17 18 19], NULL, NULL, NULL) = ? ERESTARTNOHAND (To be 
restarted)
--- SIGALRM (Alarm clock) @ 0 (0) ---
sigreturn()                             = ? (mask now [])

   But an strace of ntpd on one of our other servers, not in the broken 
state, instead shows

select(22, [16 17 18 19 20 21], NULL, NULL, NULL) = ? ERESTARTNOHAND (To 
be restarted)
--- SIGALRM (Alarm clock) @ 0 (0) ---
sigreturn()                             = ? (mask now [])
select(22, [16 17 18 19 20 21], NULL, NULL, NULL) = ? ERESTARTNOHAND (To 
be restarted)
--- SIGALRM (Alarm clock) @ 0 (0) ---
sigreturn()                             = ? (mask now [])

   Notice the difference! Not sure what that means. But maybe this is 
just a consequence of the broken ntpd still having all peers in INIT.


   An strace of the broken ntpd when it contacts a peer:

gettimeofday({1191255988, 978236}, NULL) = 0
sendto(16, "\343\0\n\354\0\0\0\0\0\273\267\317INIT\0\0\0\0\0\0\0\0"..., 
48, 0, {sa_family=AF_INET, sin_port=htons(123), 
sin_addr=inet_addr("128.196.132.177")}, 16) = 48
select(20, [16 17 18 19], NULL, NULL, NULL) = 1 (in [16])
gettimeofday({1191255988, 979721}, NULL) = 0
select(20, [16 17 18 19], NULL, NULL, {0, 0}) = 1 (in [16], left {0, 0})
recvfrom(16, 
"$\3\n\360\0\0.\365\0\0\5\3@\376\341\366\312\253\243WS\366"..., 1000, 0, 
{sa_family=AF_INET, sin_port=htons(123), 
sin_addr=inet_addr("128.196.132.177")}, [16]) = 48
recvfrom(16, 0xbff22a6c, 1000, 0, 0xbff22e54, 0xbff232c4) = -1 EAGAIN 
(Resource temporarily unavailable)

   Corresponding strace on a working ntpd:

gettimeofday({1191258202, 717415}, NULL) = 0
sendto(21, 
"#\3\n\354\0\0\0\337\0\0\17\327\200\304x%\312\253\243\365"..., 48, 0, 
{sa_family=AF_INET, sin_port=htons(123), 
sin_addr=inet_addr("128.196.132.177")}, 16) = 48
select(22, [16 17 18 19 20 21], NULL, NULL, NULL) = 1 (in [21])
gettimeofday({1191258202, 755184}, NULL) = 0
select(22, [16 17 18 19 20 21], NULL, NULL, {0, 0}) = 1 (in [21], left 
{0, 0})
recvfrom(21, 
"$\3\n\360\0\0\36T\0\0\5\327\200\304\205\351\312\253\246"..., 1092, 0, 
{sa_family=AF_INET, sin_port=htons(123), 
sin_addr=inet_addr("128.196.132.177")}, [16]) = 48
recvfrom(21, 0x812e9f4, 1092, 0, 0x812e8d8, 0xbfff0c14) = -1 EAGAIN 
(Resource temporarily unavailable)
gettimeofday({1191258202, 755875}, NULL) = 0
gettimeofday({1191258202, 756011}, NULL) = 0
time(NULL)                              = 1191258202
write(4, "54374 61402.756 128.196.132.177 "..., 85) = 85


   On the broken ntpd, the directory /var/log/ntpstats/ is empty, 
whereas on the working ntpd there are files there. Furthermore, "lsof" 
shows that the broken ntpd does not have any open descriptors for files 
in /var/log/ntpstats/. But perhaps ntpd doesn't open those descriptors 
until after a peer leaves the INIT state?


   Summary: the peer responses are making it to ntpd, but ntpd is 
ignoring them for some reason. How to figure out *why* the packets are 
being ignored? (Remember, this can't be something obviously wrong in our 
config file, which is the same across all servers. If we simply 
quit/restart the broken ntpd, it works fine.)

   What should we try next? Are there other ways to poke it with ntpq or 
ntpdc, or other ways to figure out why it is ignoring the responses from 
the peers?


This is on a Debian 4.0 stable, running ntpd 4.2.2p4 at 1.1585-o. Your help 
is appreciated!

Alexander Perlis
Department of Mathematics
The University of Arizona




More information about the questions mailing list