[ntp:questions] Sudden change in precision and jitter
unruh at invalid.ca
Sun Jun 2 20:20:48 UTC 2013
On 2013-06-02, A C <agcarver+ntp at acarver.net> wrote:
> On 6/2/2013 02:24, David Woolley wrote:
>> A C wrote:
>>> That would be interesting since I have a cron job restarting it at an
>>> odd hour away from any other cron jobs left. I'll check and see if
>> Why are you restarting it? ntpd works best if left to run continuously.
> I know it does...unless there is a bug (a compound bug between ntpd and
> the kernel) that causes ntpd to spin out of control every few weeks and
> forces me to restart it anyway. By spin out of control I do mean that
> CPU usage goes to near 100% and ntpd stops disciplining the clock after
> it managed to force the clock to run at some insane rate (e.g. nominal
> PPM tick adjustment might be -78 and it ramps the tick to +350 PPM over
> a few minutes). The end result is that the clock is very wrong, ntpd
> has totally stopped doing anything, but somehow it's caught in an
> infinite loop with maximum CPU usage meaning almost nothing else on the
> system is working right.
> I have a remote system that watches the billboard from this instance of
> ntpd (by running ntpq -p <IP> from another machine) and when the problem
> happens you can see all the offsets are in the tens of thousands and the
> log file indicates a series of moderate (less than one second) clock
> spikes and clock_syncs followed by either enough of a shift that ntpd
> stops bothering to fix the clock (deselects all peers and sits) or an
> absurd calculated clock step of approximately 2^32 - 1 seconds even
> though the clock itself is actually only out by tens or hundreds of
> seconds at most (the initial clock step correction applied when ntpd
> restarts has never been more than 200 seconds).
> And before anyone says anything, the machine/clock is not broken. It
> keeps very good time (offset from PPS is typically less than 30
> microseconds) right up until some event trips the bug. At that point
> ntpd starts hunting and stepping the clock back and forth (four to five
> clock spike_detects within a period of less than five minutes) and the
> crash. After I restart it, everything settles back down and stays fine
> for several weeks. A few weeks later everything repeats. The timing
> between the repeats is not exact, sometimes it happens in three weeks,
> sometimes in five. Once in a great while it has happened within days of
> a restart but that is rare. Three to five weeks of run time before the
> bug appears is the common failure mode.
Do you have all logging set up (peerstats, loopstats, refclocks, ....)
so you can post the contents of those files around the time that ntp
goes mad? It sure should not be doing that.
>>> there are any others that I missed and then move the restart somewhere
>>> if there's an overlap.
>>> I wish I could figure out how 4.2.6 and 4.2.7 differ in the
>>> calculations because under 4.2.6 my jitter is always 0.061 while under
>>> 4.2.7 it is always 0.122 no matter what I do.
>> There are two factors that affect precision: the clock tick rate and
>> the time needed to read the clock. The precision will tend to be based
>> on the latter, rounded up to a multiple of the former. This is then
>> rounded to a power of two fraction of a second.
>> If the time to read is very close to an exact multiple of the hardware
>> resolution, a small perturbation could change the number of ticks, which
>> might then change the result into the next power of two range.
> The issue you quote from my message is a matter of calculation not
> hardware. 4.2.6 consistently calculates a jitter of 0.061 while 4.2.7
> consistently calculates a jitter of 0.122 regardless of what the system
> is (or is not) doing at the time. That's entirely a computation change,
> it would not be repeatable if it was the hardware. But Harlan pointed
> out that the method of calculation did change so that would explain the
> difference I see between 4.2.6 and 4.2.7.
> However, a small perturbation is exactly what happened to the most
> recent restart that sent the jitter from 0.122 to 0.244. I restarted
> ntpd manually yesterday when nothing else was running and it recomputed
> a jitter of 0.122 again. I did actually discover one overlapping cron
> job that would have kept the system busy during the restart of ntpd so I
> moved ntpd's restart to an odd time. I'll see what happens in a few
> weeks when the cron job restarts ntpd again.
More information about the questions