[ntp:questions] Sudden change in precision and jitter

A C agcarver+ntp at acarver.net
Sun Jun 2 20:43:44 UTC 2013


On 6/2/2013 13:20, unruh wrote:
> On 2013-06-02, A C <agcarver+ntp at acarver.net> wrote:
>> On 6/2/2013 02:24, David Woolley wrote:
>>> A C wrote:
>>>>
>>>> That would be interesting since I have a cron job restarting it at an
>>>> odd hour away from any other cron jobs left.  I'll check and see if
>>>
>>> Why are you restarting it?  ntpd works  best if left to run continuously.
>>
>> I know it does...unless there is a bug (a compound bug between ntpd and
>> the kernel) that causes ntpd to spin out of control every few weeks and
>> forces me to restart it anyway.  By spin out of control I do mean that
>> CPU usage goes to near 100% and ntpd stops disciplining the clock after
>> it managed to force the clock to run at some insane rate (e.g. nominal
>> PPM tick adjustment might be -78 and it ramps the tick to +350 PPM over
>> a few minutes).  The end result is that the clock is very wrong, ntpd
>> has totally stopped doing anything, but somehow it's caught in an
>> infinite loop with maximum CPU usage meaning almost nothing else on the
>> system is working right.
>>
>> I have a remote system that watches the billboard from this instance of
>> ntpd (by running ntpq -p <IP> from another machine) and when the problem
>> happens you can see all the offsets are in the tens of thousands and the
>> log file indicates a series of moderate (less than one second) clock
>> spikes and clock_syncs followed by either enough of a shift that ntpd
>> stops bothering to fix the clock (deselects all peers and sits) or an
>> absurd calculated clock step of approximately 2^32 - 1 seconds even
>> though the clock itself is actually only out by tens or hundreds of
>> seconds at most (the initial clock step correction applied when ntpd
>> restarts has never been more than 200 seconds).
>>
>> And before anyone says anything, the machine/clock is not broken.  It
>> keeps very good time (offset from PPS is typically less than 30
>> microseconds) right up until some event trips the bug.  At that point
>> ntpd starts hunting and stepping the clock back and forth (four to five
>> clock spike_detects within a period of less than five minutes) and the
>> crash.  After I restart it, everything settles back down and stays fine
>> for several weeks.  A few weeks later everything repeats.  The timing
>> between the repeats is not exact, sometimes it happens in three weeks,
>> sometimes in five.  Once in a great while it has happened within days of
>> a restart but that is rare.  Three to five weeks of run time before the
>> bug appears is the common failure mode.
>
> Do you have all logging set up (peerstats, loopstats, refclocks, ....)
> so you can post the contents of those files around the time that ntp
> goes mad? It sure should not be doing that.

Yes, all logging is turned on.  Main, peer, loop, clock, sys, and raw. 
I'll post to this thread next time it takes off.  I've been trying to 
track this bug down for a long time with no luck so far.



More information about the questions mailing list