[ntp:questions] Sudden change in precision and jitter

David Lord snews at lordynet.org
Sat Aug 10 18:52:52 UTC 2013


A C wrote:
> Old thread but new data coming up.  After running for a nice while ntpd 
> finally spun out of control as I've described before.  It swung the 
> clock around and then finally stopped doing anything.  When I finally 
> restarted the clock was over 90 seconds off (the appropriate log entry 
> here):
> Aug 10 16:23:02 sunipx2 ntpd[23542]: 0.0.0.0 c41c 0c clock_step 
> -95.543901 s
> 
> I have all stats files turned on so below is a link to a combined file 
> from the configuration, main log, peers (both filtered for ATOM and SHM 
> and an unfiltered version), clockstats, loopstats, sysstats, and 
> rawstats for the time period when the system spun out.
> 
> Perhaps any of you can spot something that I'm overlooking in these 
> files.  Everything works great and then it collapses very quickly 
> (within one or two polling cycles at most).
> 
> http://acarver.net/ntpd/combinedlogs20130810.txt
> 
> If you need/want more data just say so.
> 
> 
> 
> 
> On 6/2/2013 13:43, A C wrote:
>> On 6/2/2013 13:20, unruh wrote:
>>> On 2013-06-02, A C <agcarver+ntp at acarver.net> wrote:
>>>> On 6/2/2013 02:24, David Woolley wrote:
>>>>> A C wrote:
>>>>>>
>>>>>> That would be interesting since I have a cron job restarting it at an
>>>>>> odd hour away from any other cron jobs left.  I'll check and see if
>>>>>
>>>>> Why are you restarting it?  ntpd works  best if left to run
>>>>> continuously.
>>>>
>>>> I know it does...unless there is a bug (a compound bug between ntpd and
>>>> the kernel) that causes ntpd to spin out of control every few weeks and
>>>> forces me to restart it anyway.  By spin out of control I do mean that
>>>> CPU usage goes to near 100% and ntpd stops disciplining the clock after
>>>> it managed to force the clock to run at some insane rate (e.g. nominal
>>>> PPM tick adjustment might be -78 and it ramps the tick to +350 PPM over
>>>> a few minutes).  The end result is that the clock is very wrong, ntpd
>>>> has totally stopped doing anything, but somehow it's caught in an
>>>> infinite loop with maximum CPU usage meaning almost nothing else on the
>>>> system is working right.
>>>>
>>>> I have a remote system that watches the billboard from this instance of
>>>> ntpd (by running ntpq -p <IP> from another machine) and when the 
>>>> problem
>>>> happens you can see all the offsets are in the tens of thousands and 
>>>> the
>>>> log file indicates a series of moderate (less than one second) clock
>>>> spikes and clock_syncs followed by either enough of a shift that ntpd
>>>> stops bothering to fix the clock (deselects all peers and sits) or an
>>>> absurd calculated clock step of approximately 2^32 - 1 seconds even
>>>> though the clock itself is actually only out by tens or hundreds of
>>>> seconds at most (the initial clock step correction applied when ntpd
>>>> restarts has never been more than 200 seconds).
>>>>
>>>> And before anyone says anything, the machine/clock is not broken.  It
>>>> keeps very good time (offset from PPS is typically less than 30
>>>> microseconds) right up until some event trips the bug.  At that point
>>>> ntpd starts hunting and stepping the clock back and forth (four to five
>>>> clock spike_detects within a period of less than five minutes) and the
>>>> crash.  After I restart it, everything settles back down and stays fine
>>>> for several weeks.  A few weeks later everything repeats.  The timing
>>>> between the repeats is not exact, sometimes it happens in three weeks,
>>>> sometimes in five.  Once in a great while it has happened within 
>>>> days of
>>>> a restart but that is rare.  Three to five weeks of run time before the
>>>> bug appears is the common failure mode.
>>>
>>> Do you have all logging set up (peerstats, loopstats, refclocks, ....)
>>> so you can post the contents of those files around the time that ntp
>>> goes mad? It sure should not be doing that.
>>
>> Yes, all logging is turned on.  Main, peer, loop, clock, sys, and raw.
>> I'll post to this thread next time it takes off.  I've been trying to
>> track this bug down for a long time with no luck so far.
>>


Hi

what hit me was your "tos minsane 1"

Both my GPS and MSF sources I'm told cannot be blacked out by
weather conditions but I also see flying saucers.

ntp2.lordynet.org.uk has been in the pool since late 2009:

# ntp.conf

tos             minsane 3
tos             orphan 10
tos             mindist 0.01

# radioclkd2 -s timepps tty00:-dcd
server  127.127.28.0
fudge   127.127.28.0  stratum 4  time1 0.024000  refid MSFa  # 13062901

peer -4 ntp1.lordynet.org minpoll 6 maxpoll 7 iburst
peer -4 ntp3.lordynet.org minpoll 6 maxpoll 7 iburst

server -4 ntp0.lordynet.org.uk minpoll 6 maxpoll 7 iburst
server -4 xxxxx minpoll 6 maxpoll 7 iburst  prefer

server -4 xxxxx minpoll 8 maxpoll 12 iburst
server -4 xxxxx minpoll 8 maxpoll 12 iburst
server -4 xxxxx minpoll 8 maxpoll 12
server -4 xxxxx minpoll 8 maxpoll 12
server -4 xxxxx minpoll 8 maxpoll 12


# ntpq -c rv -p

associd=0 status=061d leap_none, sync_ntp, 1 event, kern,
version="ntpd 4.2.6p5-o Wed Feb  1 07:49:06 UTC 2012 (import)",
processor="i386", system="NetBSD/6.1_STABLE", leap=00, stratum=2,
precision=-18, rootdelay=0.695, rootdisp=411.242, refid=192.168.59.61,
reftime=d5b0fec3.f491b27b  Sat, Aug 10 2013 18:02:43.955,
clock=d5b0ff83.4cef85c4  Sat, Aug 10 2013 18:05:55.300, peer=38671, tc=7,
mintc=3, offset=0.163, frequency=-49.792, sys_jitter=0.378,
clk_jitter=0.096, clk_wander=0.001, tai=35, leapsec=201207010000,
expire=201312010000

      remote           refid      st t when poll reach   delay   offset 
  jitter
==============================================================================
-SHM(0)          .MSFa.           4 l   92   64  132    0.000   -3.624 
  2.700
-ns1.lordynet.or 129.215.42.240   3 u   30  128  376    0.299   -0.597 
  0.056
-ns3.lordynet.or 129.215.160.240  3 u   85  128  376    0.371   -0.006 
  0.099
-ns0.lordynet.or 195.173.57.232   3 u   12  128  377    0.383    1.133 
  0.652
*xxxxxxxxxxxxxxx .PPSb.           1 u   64  128  377    0.695    0.163 
  0.378
xxxxxxxxxxxxxxxx xxxxxxxxxxxxxx   2 u   98  256  377   21.771   -0.086 
  0.307
xxxxxxxxxxxxxxxx xxxxxxxxxxxxxx   2 u  190  256  377   23.925    0.833 
  0.348
xxxxxxxxxxxxxxxx xxxxxxxxxxxxxx   2 u   36  256  377   29.408    0.156 
  0.170
xxxxxxxxxxxxxxxx xxxxxxxxxxxxxx   2 u   66  256  377   30.961    1.058 
  0.219
+xxxxxxxxxxxxxxx xxxxxxxxxxxxxx   2 u  111  256  377   48.076    0.179 
  0.355

David




More information about the questions mailing list