[ntp:questions] ntpd sometimes goes wacko, all peers in INIT, packets mostly ignored, but restart fixes it
Alexander Perlis
aprl at math.arizona.edu
Mon Oct 1 17:13:49 UTC 2007
Greetings. Can someone tell me what to try next to figure out the
following problem? We're stumped.
All our servers have identical ntpd configurations. Sometimes one of
them (not always the same one) doesn't track the time. But if we restart
ntpd, it works just fine again. I have no idea how to reproduce the
problem directly --- I just have to wait until one of the servers shows
incorrect time.
At this moment I have a server in the broken state. It has:
'ntptime' returns code 5 (ERROR)
'ntpq -c pe' says each peer is in state ".INIT."
'ntpdc -c iostats' says:
dropped packets: 0
ignored packets: 84832
received packets: 57
packets sent: 48129
packets not sent: 0
If I restart ntp, which makes everything start working, then iostats
tells me that ignored packet count is 0. In other words, most of our
servers have ignored packets at 0. The fact that the wacko ntpd has a
large ignored packet count hints that this must be where the problem
lies. Why does ntpd sometimes go into a weird state and start ignoring
the majority of packets?
I turned on 'tcpdump -X -s 0 dst port 123' and can confirm that the
wacko ntpd does periodically send UDP packets to the peers and that the
peers immediately respond. Which makes sense, since the ignored packets
count reported by iostats is indeed increasing over time.
Furthermore, the output of 'netstat -avp' confirms that ntpd is
listening on port 123. Well, that also makes sense.
There are no errors from ntpd in the syslog.
An strace of the broken ntpd shows the following repeatedly:
select(20, [16 17 18 19], NULL, NULL, NULL) = ? ERESTARTNOHAND (To be
restarted)
--- SIGALRM (Alarm clock) @ 0 (0) ---
sigreturn() = ? (mask now [])
select(20, [16 17 18 19], NULL, NULL, NULL) = ? ERESTARTNOHAND (To be
restarted)
--- SIGALRM (Alarm clock) @ 0 (0) ---
sigreturn() = ? (mask now [])
But an strace of ntpd on one of our other servers, not in the broken
state, instead shows
select(22, [16 17 18 19 20 21], NULL, NULL, NULL) = ? ERESTARTNOHAND (To
be restarted)
--- SIGALRM (Alarm clock) @ 0 (0) ---
sigreturn() = ? (mask now [])
select(22, [16 17 18 19 20 21], NULL, NULL, NULL) = ? ERESTARTNOHAND (To
be restarted)
--- SIGALRM (Alarm clock) @ 0 (0) ---
sigreturn() = ? (mask now [])
Notice the difference! Not sure what that means. But maybe this is
just a consequence of the broken ntpd still having all peers in INIT.
An strace of the broken ntpd when it contacts a peer:
gettimeofday({1191255988, 978236}, NULL) = 0
sendto(16, "\343\0\n\354\0\0\0\0\0\273\267\317INIT\0\0\0\0\0\0\0\0"...,
48, 0, {sa_family=AF_INET, sin_port=htons(123),
sin_addr=inet_addr("128.196.132.177")}, 16) = 48
select(20, [16 17 18 19], NULL, NULL, NULL) = 1 (in [16])
gettimeofday({1191255988, 979721}, NULL) = 0
select(20, [16 17 18 19], NULL, NULL, {0, 0}) = 1 (in [16], left {0, 0})
recvfrom(16,
"$\3\n\360\0\0.\365\0\0\5\3@\376\341\366\312\253\243WS\366"..., 1000, 0,
{sa_family=AF_INET, sin_port=htons(123),
sin_addr=inet_addr("128.196.132.177")}, [16]) = 48
recvfrom(16, 0xbff22a6c, 1000, 0, 0xbff22e54, 0xbff232c4) = -1 EAGAIN
(Resource temporarily unavailable)
Corresponding strace on a working ntpd:
gettimeofday({1191258202, 717415}, NULL) = 0
sendto(21,
"#\3\n\354\0\0\0\337\0\0\17\327\200\304x%\312\253\243\365"..., 48, 0,
{sa_family=AF_INET, sin_port=htons(123),
sin_addr=inet_addr("128.196.132.177")}, 16) = 48
select(22, [16 17 18 19 20 21], NULL, NULL, NULL) = 1 (in [21])
gettimeofday({1191258202, 755184}, NULL) = 0
select(22, [16 17 18 19 20 21], NULL, NULL, {0, 0}) = 1 (in [21], left
{0, 0})
recvfrom(21,
"$\3\n\360\0\0\36T\0\0\5\327\200\304\205\351\312\253\246"..., 1092, 0,
{sa_family=AF_INET, sin_port=htons(123),
sin_addr=inet_addr("128.196.132.177")}, [16]) = 48
recvfrom(21, 0x812e9f4, 1092, 0, 0x812e8d8, 0xbfff0c14) = -1 EAGAIN
(Resource temporarily unavailable)
gettimeofday({1191258202, 755875}, NULL) = 0
gettimeofday({1191258202, 756011}, NULL) = 0
time(NULL) = 1191258202
write(4, "54374 61402.756 128.196.132.177 "..., 85) = 85
On the broken ntpd, the directory /var/log/ntpstats/ is empty,
whereas on the working ntpd there are files there. Furthermore, "lsof"
shows that the broken ntpd does not have any open descriptors for files
in /var/log/ntpstats/. But perhaps ntpd doesn't open those descriptors
until after a peer leaves the INIT state?
Summary: the peer responses are making it to ntpd, but ntpd is
ignoring them for some reason. How to figure out *why* the packets are
being ignored? (Remember, this can't be something obviously wrong in our
config file, which is the same across all servers. If we simply
quit/restart the broken ntpd, it works fine.)
What should we try next? Are there other ways to poke it with ntpq or
ntpdc, or other ways to figure out why it is ignoring the responses from
the peers?
This is on a Debian 4.0 stable, running ntpd 4.2.2p4 at 1.1585-o. Your help
is appreciated!
Alexander Perlis
Department of Mathematics
The University of Arizona
More information about the questions
mailing list