[ntp:questions] Sudden change in precision and jitter

A C agcarver+ntp at acarver.net
Sun Jun 2 17:44:23 UTC 2013


On 6/2/2013 02:24, David Woolley wrote:
> A C wrote:
>>
>> That would be interesting since I have a cron job restarting it at an
>> odd hour away from any other cron jobs left.  I'll check and see if
>
> Why are you restarting it?  ntpd works  best if left to run continuously.

I know it does...unless there is a bug (a compound bug between ntpd and 
the kernel) that causes ntpd to spin out of control every few weeks and 
forces me to restart it anyway.  By spin out of control I do mean that 
CPU usage goes to near 100% and ntpd stops disciplining the clock after 
it managed to force the clock to run at some insane rate (e.g. nominal 
PPM tick adjustment might be -78 and it ramps the tick to +350 PPM over 
a few minutes).  The end result is that the clock is very wrong, ntpd 
has totally stopped doing anything, but somehow it's caught in an 
infinite loop with maximum CPU usage meaning almost nothing else on the 
system is working right.

I have a remote system that watches the billboard from this instance of 
ntpd (by running ntpq -p <IP> from another machine) and when the problem 
happens you can see all the offsets are in the tens of thousands and the 
log file indicates a series of moderate (less than one second) clock 
spikes and clock_syncs followed by either enough of a shift that ntpd 
stops bothering to fix the clock (deselects all peers and sits) or an 
absurd calculated clock step of approximately 2^32 - 1 seconds even 
though the clock itself is actually only out by tens or hundreds of 
seconds at most (the initial clock step correction applied when ntpd 
restarts has never been more than 200 seconds).

And before anyone says anything, the machine/clock is not broken.  It 
keeps very good time (offset from PPS is typically less than 30 
microseconds) right up until some event trips the bug.  At that point 
ntpd starts hunting and stepping the clock back and forth (four to five 
clock spike_detects within a period of less than five minutes) and the 
crash.  After I restart it, everything settles back down and stays fine 
for several weeks.  A few weeks later everything repeats.  The timing 
between the repeats is not exact, sometimes it happens in three weeks, 
sometimes in five.  Once in a great while it has happened within days of 
a restart but that is rare.  Three to five weeks of run time before the 
bug appears is the common failure mode.

>
>> there are any others that I missed and then move the restart somewhere
>> if there's an overlap.
>>
>> I wish I could figure out how 4.2.6 and 4.2.7 differ in the
>> calculations because under 4.2.6 my jitter is always 0.061 while under
>> 4.2.7 it is always 0.122 no matter what I do.
>
> There are two factors that affect precision:  the clock tick rate and
> the time needed to read the clock. The precision will tend to be based
> on the latter, rounded up to a multiple of the former.  This is then
> rounded to a power of two fraction of a second.
>
> If the time to read is very close to an exact multiple of the hardware
> resolution, a small perturbation could change the number of ticks, which
> might then change the result into the next power of two range.


The issue you quote from my message is a matter of calculation not 
hardware.  4.2.6 consistently calculates a jitter of 0.061 while 4.2.7 
consistently calculates a jitter of 0.122 regardless of what the system 
is (or is not) doing at the time.  That's entirely a computation change, 
it would not be repeatable if it was the hardware.  But Harlan pointed 
out that the method of calculation did change so that would explain the 
difference I see between 4.2.6 and 4.2.7.

However, a small perturbation is exactly what happened to the most 
recent restart that sent the jitter from 0.122 to 0.244.  I restarted 
ntpd manually yesterday when nothing else was running and it recomputed 
a jitter of 0.122 again.  I did actually discover one overlapping cron 
job that would have kept the system busy during the restart of ntpd so I 
moved ntpd's restart to an odd time.  I'll see what happens in a few 
weeks when the cron job restarts ntpd again.


More information about the questions mailing list