[ntp:questions] Ntpd in uninterruptible sleep?

Dave Hart hart at ntp.org
Sat Nov 5 01:40:15 UTC 2011


On Fri, Nov 4, 2011 at 21:52, A C <agcarver+ntp at acarver.net> wrote:
> Ok, so ntpd does not respond to a SIGQUIT with a core dump but I did manage
> to attach a trace to the process before killing it.  The output of ktrace is
> below.  The circumstances of this particular lockup I was actually able to
> observe.  I camped out at the system waiting.
>
> A cron job fired off which does routine disk maintenance (diffs of config
> files, free space calculations, postfix cleanup, etc.)  In this case things
> got swapped around a bit while all the cleanup was occuring.  However, after
> those activities finished, ntpd never returned to normal.  It just spun out
> of control resulting in the trace below. Other programs that were running at
> the same time (gpsd, xclock, xterm) all recovered cleanly though the system
> was now bogged down by ntpd consuming almost all the processor time even
> though it was not set to high priority.  The capture below pretty much loops
> continuously until ntpd is finally killed.  I actually let the system run
> for an additional 24 hours in this state just to see if it would bounce back
> but it never did.  I killed only one process, ntpd, and everything else was
> fine as the CPU load dropped to near zero immediately.
>
> 1210      1 ntpd     CALL  clock_gettime(0,0xefffd0e8)
> 1210      1 ntpd     RET   clock_gettime 0, -268447512/0xefffd0e8
> 1210      1 ntpd     CALL  select(0x1c,0xefffd05c,0,0,0xefffd0b4)
> 1210      1 ntpd     RET   select 1, -268447652/0xefffd05c
> 1210      1 ntpd     CALL
> recvfrom(0x16,0xefffcc74,0x3e8,0,0xefffd098,0xefffd0ec)
> 1210      1 ntpd     MISC  msghdr: 28,
> 00000000f02cf7e0f25edeac00000001000000000001b58400000000
> 1210      1 ntpd     GIO   fd 22 read 12 bytes
>     "\^V\^A\0\^A\0\0\0\0\0\0\0\0"
> 1210      1 ntpd     MISC  sockname: 16, 1002de040a00008d0000000000000000
> 1210      1 ntpd     RET   recvfrom 12/0xc, -268448652/0xefffcc74

>From your netbsd.org mailing list traffic, I believe you're using
NetBSD 5.x.  Looking at ntpd/ntp_io.c, recvfrom() is not the call I'd
expect to see happen, has NetBSD 5.x supports SO_TIMESTAMP, so #ifdef
HAVE_TIMESTAMP code is active, and ntpd would typically use recvmsg()
rather than recvfrom().  See read_network_packet() in ntpd/ntp_io.c  I
say typically because if either

1.  the particular local address ("interface") to which the socket is
bound is ignoring input (as ntpd's wildcard sockets do, and others can
be configured to do via "interface ___ drop" in ntp.conf), or
2.  ntpd has no receive buffers available

then ntpd will use recvfrom() to a stack-based buffer (0xefffcc74
here) and discard the data so read.  My hunch is ntpd is somehow
getting wedged during your cron jobs so that all receive buffers are
consumed and more can not be allocated.  You can monitor the situation
using ntpq -c iostats on 4.2.7, or ntpdc -c iostats on earlier
versions.  Pay attention to free receive buffers and dropped packets
(due to no buffer) in particular.

ntpd can't allocate more receive buffers safely when handling SIGIO.
That is done later, after the signal handler has returned, as a side
effect of pulling a "full" receive buffer from a list of previously
received packets for processing, if a packet had been dropped
previously due to lack of receive buffers.  To debug if you've found a
corner case where that allocation code never gets called, i suggest
you try changing this code in libntp/recvbuff.c from:

isc_boolean_t has_full_recv_buffer(void)
{
	if (HEAD_FIFO(full_recv_fifo) != NULL)
		return (ISC_TRUE);
	else
		return (ISC_FALSE);
}

to

isc_boolean_t has_full_recv_buffer(void)
{
	if (HEAD_FIFO(full_recv_fifo) != NULL)
		return (ISC_TRUE);
	else {
		/* allocate more buffers if needed as side effect in
get_full_recv_buffer() (which will return NULL) */
		get_full_recv_buffer();
		return (ISC_FALSE);
	}
}

If receive buffer shortfall is the event that triggers ntpd hanging,
you might be able to accelerate reproducing the problem by tweaking
these values in include/recvbuff.h:

/*
 * recvbuf memory management
 */
#define RECV_INIT	10	/* 10 buffers initially */
#define RECV_LOWAT	3	/* when we're down to three buffers get more */
#define RECV_INC	5	/* get 5 more at a time */
#define RECV_TOOMANY	40	/* this is way too many buffers */

Good luck,
Dave Hart


More information about the questions mailing list