[ntp:questions] Odd behaviour in a multicast setting (bug in ntpd???)

Marco Marongiu brontolinux at gmail.com
Fri Aug 26 14:10:21 UTC 2011


Hello everybody

We have a problem in one of our datacenters. The short version is: we
are using only multicast clients and multicast servers, but all clients
behave like unicast clients to one of them. This odd behaviour didn't
change one month ago by restarting ntpd, but now does.

Either this is some weird misconfiguration on my side, or it is a bug in
ntpd (but I don't seem to realize how such a bug gets triggered). If you
think you can help, please consider reading the details below (long).
The configuration file for clients is also included.

Thanks in advance to all those who will help

Ciao
-- bronto

-----------------------------------------------------------------------

We have four multicast NTP servers, feeding more than 600 clients; each
server uses a different symmetric key, and all the four of them are
trusted on clients.

All machines are running Debian Linux (clients and servers), Squeeze and
Lenny on servers and clients respectively. All are configured with two
interfaces in bonding, so that packets are sent by the servers and
received by the clients via the bond0 interface.

All machines were installed using FAI.

Clients are pure multicast NTP (they don't poll other servers via
unicast, nor they peer with other machines, nor they listen to broadcast).

ntpd versions involved are 4.2.6p2 on servers, and 4.2.4p4 on clients:
these are the versions bundled into the respective Debian distributions.

All clients are working well but for one thing. One of the servers,
always the same, and always the first in the list when you run ntpq's
peer command, appears with a poll interval of 1024 seconds. We'll call
this server s51 from now on.

The first time I noticed this behaviour was in July. I am pretty sure I
tried to restart one of the clients, but it didn't change: after a
while, s51 appeared with a 1024s poll interval again, like it was
configured as unicast. Since everything was working well enough, I
decided to leave a full debug for after my vacation.

After about a month, I checked on s51 with tcpdump to be sure it was
sending the packets at the right interval, and it was: each packet got
sent about 63-68 seconds after the previous one.

Then I went to the clients, and again with tcpdump I saw the packets
were coming in at the right time. The question then was: are they really
"caucht" by ntpd?

I did a first investigation with strace. The involved file descriptors
were 16 (associated with *:ntp), 22 (bond0's address:ntp) and 23
(ntp.mcast.net:ntp), all in IPv4. Apparently, the packets were coming in
and were correctly handled by ntpd.

That was all strange, so I decided to switch peerstats on to be really
sure that ntpd was actually handling them. And there the strange
happened: as soon as I changed the configuration and restarted ntpd, it
worked well. Thinking it could have been the change that had triggered
this change in behaviour, I rolled back my changes, literally copying
the previous file back in place. I restarted ntp, and it was still
working correctly.

I went to another bad behaving client and I restarted ntpd without doing
any change. ntpd worked well from that moment on. The same happened in a
few other clients, too.

Now I had some good clients and a lot of bad ones. And they were
multicast clients, so they were receiving packets from the servers at
the same time. I decided to let strace run again (with simple timestamps
enabled) on two of them, a good one and a bad one, and compare the traces.

A first diff showed some obvious differences (e.g.: calls to adjtimex()
happened at different times on the two machines, or recvmsg/recvfrom
parameters were a bit different), but also some strangeness, like a
number of sendto calls on FD 23 on the bad client, sending stuff to s51.

Filtering on the recv* calls gave some other interesting results: for
each of the two clients I printed out the timestamp, the call, and the
file descriptor it was called on, and diff-ed again. This showed that
the bad client was executing some recvmsg calls more than its good mate:

$ diff *recv-calls
168a169,170
> 11:47:53 recvmsg 22
> 11:47:53 recvmsg 22
416a419,420
> 12:04:56 recvmsg 22
> 12:04:56 recvmsg 22
544a549,550
> 12:13:27 recvmsg 22
> 12:13:27 recvmsg 22
668a675,676
> 12:21:58 recvmsg 22
> 12:21:58 recvmsg 22
920a929,930
> 12:39:04 recvmsg 22
> 12:39:04 recvmsg 22
1172a1183,1184
> 12:56:08 recvmsg 22
> 12:56:08 recvmsg 22
1416a1429,1430
> 13:13:13 recvmsg 22
> 13:13:13 recvmsg 22

You may notice that the calls are displaced by either ~512 or ~1024
seconds. All these calls looked like the following (some info hidden
with [CUT]):

> 11:47:53 recvmsg(22, {msg_name(16)={sa_family=AF_INET, sin_port=htons(123), sin_addr=inet_addr("[CUT]")}, msg_iov(1)=[{"[CUT]"..., 1092}], msg_controllen=32, {cmsg_len=32, cmsg_level=SOL_SOCKET, cmsg_type=0x1d /* SCM_??? */, ...}, msg_flags=0}, 0) = 68
> 11:47:53 recvmsg(22, 0x[CUT], 0) = -1 EAGAIN (Resource temporarily unavailable)


Finally, a tcpdump running since yesterday showed another strange
behaviour. Basically, the bad client is actually working like an unicast
client to s51. Yesterday, tcpdump showed the client was polling
regularly at ~1024s intervals, and ntpq agreed it was a multicast client:

>      remote           refid      st t when poll reach   delay   offset  jitter
> ==============================================================================
> -s51             69.25.96.13      2 m   97 1024  377    0.292    6.520   0.074
> +s11             132.163.4.101    2 m   34   64  376    0.182    2.249   0.120
> +s12             132.163.4.102    2 m   26   64  377    0.001    0.692   0.030
> *s52             131.107.13.100   2 m   40   64  377    0.001   -2.613   0.096

> 13:13:13.745092 IP badclient.ntp > s51.ntp: NTPv4, Client, length 68
> 13:30:18.745117 IP badclient.ntp > s51.ntp: NTPv4, Client, length 68
> 13:47:23.745120 IP badclient.ntp > s51.ntp: NTPv4, Client, length 68
> 14:04:27.744838 IP badclient.ntp > s51.ntp: NTPv4, Client, length 68
> 14:21:33.744808 IP badclient.ntp > s51.ntp: NTPv4, Client, length 68

Then suddenly something happens, and the polls frequency increases up to
one every 64 seconds:

> 15:12:47.744825 IP badclient.ntp > s51.ntp: NTPv4, Client, length 68
> 15:21:21.744771 IP badclient.ntp > s51.ntp: NTPv4, Client, length 68
> 15:23:30.744865 IP badclient.ntp > s51.ntp: NTPv4, Client, length 68
> 15:24:34.744764 IP badclient.ntp > s51.ntp: NTPv4, Client, length 68
> 15:25:40.744775 IP badclient.ntp > s51.ntp: NTPv4, Client, length 68
> 15:26:45.744849 IP badclient.ntp > s51.ntp: NTPv4, Client, length 68

and nothing relevant is reported in the logs (only "synchronized to"
messages).

The poll interval appears to have increased and decreased during the
last 24 hours. It was 64 seconds again a few minutes ago and is now
increasing. I am quite surprised to see that ntpq says that this clients
is unicast to s51:

>      remote           refid      st t when poll reach   delay   offset  jitter
> ==============================================================================
> -s51             69.25.96.13      2 u    6  128  377    0.228    7.140   0.123
> +s11             132.163.4.101    2 m   41   64  376    0.182   -2.080   0.048
> +s12             132.163.4.102    2 m   50   64  376    0.001    1.441   0.045
> *s52             204.9.54.119     2 m   64   64  376    0.001    1.068   0.058

...and all this leaves me clueless. I really don't understand what is
going on here. Suggestions?

Thanks a lot
-- Marco


-----------------------------------------------------------------------

# /etc/ntp.conf, configuration for ntpd; see ntp.conf(5) for help
#
# NTP mcast client configuration

driftfile /var/lib/ntp/ntp.drift


keys /etc/ntp/ntp.keys
trustedkey 1 2 3 4
multicastclient 224.0.1.1

# Access control configuration; see
/usr/share/doc/ntp-doc/html/accopt.html for
# details.  The web page
<http://support.ntp.org/bin/view/Support/AccessRestrictions>
# might also be helpful.
#
# Note that "restrict" applies to both servers and clients, so a
configuration
# that might be intended to block requests from certain clients could
also end
# up blocking replies from your own upstream servers.

# By default, exchange time with everybody, but don't allow configuration.

restrict -4 default kod notrap nomodify nopeer noquery notrust
restrict -6 default kod notrap nomodify nopeer noquery notrust

# Local users may interrogate the ntp server more closely.
restrict 127.0.0.1
restrict ::1



More information about the questions mailing list