[ntp:questions] Sudden change in precision and jitter
agcarver+ntp at acarver.net
Sat Aug 10 17:07:43 UTC 2013
Old thread but new data coming up. After running for a nice while ntpd
finally spun out of control as I've described before. It swung the
clock around and then finally stopped doing anything. When I finally
restarted the clock was over 90 seconds off (the appropriate log entry
Aug 10 16:23:02 sunipx2 ntpd: 0.0.0.0 c41c 0c clock_step -95.543901 s
I have all stats files turned on so below is a link to a combined file
from the configuration, main log, peers (both filtered for ATOM and SHM
and an unfiltered version), clockstats, loopstats, sysstats, and
rawstats for the time period when the system spun out.
Perhaps any of you can spot something that I'm overlooking in these
files. Everything works great and then it collapses very quickly
(within one or two polling cycles at most).
If you need/want more data just say so.
On 6/2/2013 13:43, A C wrote:
> On 6/2/2013 13:20, unruh wrote:
>> On 2013-06-02, A C <agcarver+ntp at acarver.net> wrote:
>>> On 6/2/2013 02:24, David Woolley wrote:
>>>> A C wrote:
>>>>> That would be interesting since I have a cron job restarting it at an
>>>>> odd hour away from any other cron jobs left. I'll check and see if
>>>> Why are you restarting it? ntpd works best if left to run
>>> I know it does...unless there is a bug (a compound bug between ntpd and
>>> the kernel) that causes ntpd to spin out of control every few weeks and
>>> forces me to restart it anyway. By spin out of control I do mean that
>>> CPU usage goes to near 100% and ntpd stops disciplining the clock after
>>> it managed to force the clock to run at some insane rate (e.g. nominal
>>> PPM tick adjustment might be -78 and it ramps the tick to +350 PPM over
>>> a few minutes). The end result is that the clock is very wrong, ntpd
>>> has totally stopped doing anything, but somehow it's caught in an
>>> infinite loop with maximum CPU usage meaning almost nothing else on the
>>> system is working right.
>>> I have a remote system that watches the billboard from this instance of
>>> ntpd (by running ntpq -p <IP> from another machine) and when the problem
>>> happens you can see all the offsets are in the tens of thousands and the
>>> log file indicates a series of moderate (less than one second) clock
>>> spikes and clock_syncs followed by either enough of a shift that ntpd
>>> stops bothering to fix the clock (deselects all peers and sits) or an
>>> absurd calculated clock step of approximately 2^32 - 1 seconds even
>>> though the clock itself is actually only out by tens or hundreds of
>>> seconds at most (the initial clock step correction applied when ntpd
>>> restarts has never been more than 200 seconds).
>>> And before anyone says anything, the machine/clock is not broken. It
>>> keeps very good time (offset from PPS is typically less than 30
>>> microseconds) right up until some event trips the bug. At that point
>>> ntpd starts hunting and stepping the clock back and forth (four to five
>>> clock spike_detects within a period of less than five minutes) and the
>>> crash. After I restart it, everything settles back down and stays fine
>>> for several weeks. A few weeks later everything repeats. The timing
>>> between the repeats is not exact, sometimes it happens in three weeks,
>>> sometimes in five. Once in a great while it has happened within days of
>>> a restart but that is rare. Three to five weeks of run time before the
>>> bug appears is the common failure mode.
>> Do you have all logging set up (peerstats, loopstats, refclocks, ....)
>> so you can post the contents of those files around the time that ntp
>> goes mad? It sure should not be doing that.
> Yes, all logging is turned on. Main, peer, loop, clock, sys, and raw.
> I'll post to this thread next time it takes off. I've been trying to
> track this bug down for a long time with no luck so far.
More information about the questions