[ntp:questions] Sudden change in precision and jitter
David Lord
snews at lordynet.org
Sat Aug 10 18:52:52 UTC 2013
A C wrote:
> Old thread but new data coming up. After running for a nice while ntpd
> finally spun out of control as I've described before. It swung the
> clock around and then finally stopped doing anything. When I finally
> restarted the clock was over 90 seconds off (the appropriate log entry
> here):
> Aug 10 16:23:02 sunipx2 ntpd[23542]: 0.0.0.0 c41c 0c clock_step
> -95.543901 s
>
> I have all stats files turned on so below is a link to a combined file
> from the configuration, main log, peers (both filtered for ATOM and SHM
> and an unfiltered version), clockstats, loopstats, sysstats, and
> rawstats for the time period when the system spun out.
>
> Perhaps any of you can spot something that I'm overlooking in these
> files. Everything works great and then it collapses very quickly
> (within one or two polling cycles at most).
>
> http://acarver.net/ntpd/combinedlogs20130810.txt
>
> If you need/want more data just say so.
>
>
>
>
> On 6/2/2013 13:43, A C wrote:
>> On 6/2/2013 13:20, unruh wrote:
>>> On 2013-06-02, A C <agcarver+ntp at acarver.net> wrote:
>>>> On 6/2/2013 02:24, David Woolley wrote:
>>>>> A C wrote:
>>>>>>
>>>>>> That would be interesting since I have a cron job restarting it at an
>>>>>> odd hour away from any other cron jobs left. I'll check and see if
>>>>>
>>>>> Why are you restarting it? ntpd works best if left to run
>>>>> continuously.
>>>>
>>>> I know it does...unless there is a bug (a compound bug between ntpd and
>>>> the kernel) that causes ntpd to spin out of control every few weeks and
>>>> forces me to restart it anyway. By spin out of control I do mean that
>>>> CPU usage goes to near 100% and ntpd stops disciplining the clock after
>>>> it managed to force the clock to run at some insane rate (e.g. nominal
>>>> PPM tick adjustment might be -78 and it ramps the tick to +350 PPM over
>>>> a few minutes). The end result is that the clock is very wrong, ntpd
>>>> has totally stopped doing anything, but somehow it's caught in an
>>>> infinite loop with maximum CPU usage meaning almost nothing else on the
>>>> system is working right.
>>>>
>>>> I have a remote system that watches the billboard from this instance of
>>>> ntpd (by running ntpq -p <IP> from another machine) and when the
>>>> problem
>>>> happens you can see all the offsets are in the tens of thousands and
>>>> the
>>>> log file indicates a series of moderate (less than one second) clock
>>>> spikes and clock_syncs followed by either enough of a shift that ntpd
>>>> stops bothering to fix the clock (deselects all peers and sits) or an
>>>> absurd calculated clock step of approximately 2^32 - 1 seconds even
>>>> though the clock itself is actually only out by tens or hundreds of
>>>> seconds at most (the initial clock step correction applied when ntpd
>>>> restarts has never been more than 200 seconds).
>>>>
>>>> And before anyone says anything, the machine/clock is not broken. It
>>>> keeps very good time (offset from PPS is typically less than 30
>>>> microseconds) right up until some event trips the bug. At that point
>>>> ntpd starts hunting and stepping the clock back and forth (four to five
>>>> clock spike_detects within a period of less than five minutes) and the
>>>> crash. After I restart it, everything settles back down and stays fine
>>>> for several weeks. A few weeks later everything repeats. The timing
>>>> between the repeats is not exact, sometimes it happens in three weeks,
>>>> sometimes in five. Once in a great while it has happened within
>>>> days of
>>>> a restart but that is rare. Three to five weeks of run time before the
>>>> bug appears is the common failure mode.
>>>
>>> Do you have all logging set up (peerstats, loopstats, refclocks, ....)
>>> so you can post the contents of those files around the time that ntp
>>> goes mad? It sure should not be doing that.
>>
>> Yes, all logging is turned on. Main, peer, loop, clock, sys, and raw.
>> I'll post to this thread next time it takes off. I've been trying to
>> track this bug down for a long time with no luck so far.
>>
Hi
what hit me was your "tos minsane 1"
Both my GPS and MSF sources I'm told cannot be blacked out by
weather conditions but I also see flying saucers.
ntp2.lordynet.org.uk has been in the pool since late 2009:
# ntp.conf
tos minsane 3
tos orphan 10
tos mindist 0.01
# radioclkd2 -s timepps tty00:-dcd
server 127.127.28.0
fudge 127.127.28.0 stratum 4 time1 0.024000 refid MSFa # 13062901
peer -4 ntp1.lordynet.org minpoll 6 maxpoll 7 iburst
peer -4 ntp3.lordynet.org minpoll 6 maxpoll 7 iburst
server -4 ntp0.lordynet.org.uk minpoll 6 maxpoll 7 iburst
server -4 xxxxx minpoll 6 maxpoll 7 iburst prefer
server -4 xxxxx minpoll 8 maxpoll 12 iburst
server -4 xxxxx minpoll 8 maxpoll 12 iburst
server -4 xxxxx minpoll 8 maxpoll 12
server -4 xxxxx minpoll 8 maxpoll 12
server -4 xxxxx minpoll 8 maxpoll 12
# ntpq -c rv -p
associd=0 status=061d leap_none, sync_ntp, 1 event, kern,
version="ntpd 4.2.6p5-o Wed Feb 1 07:49:06 UTC 2012 (import)",
processor="i386", system="NetBSD/6.1_STABLE", leap=00, stratum=2,
precision=-18, rootdelay=0.695, rootdisp=411.242, refid=192.168.59.61,
reftime=d5b0fec3.f491b27b Sat, Aug 10 2013 18:02:43.955,
clock=d5b0ff83.4cef85c4 Sat, Aug 10 2013 18:05:55.300, peer=38671, tc=7,
mintc=3, offset=0.163, frequency=-49.792, sys_jitter=0.378,
clk_jitter=0.096, clk_wander=0.001, tai=35, leapsec=201207010000,
expire=201312010000
remote refid st t when poll reach delay offset
jitter
==============================================================================
-SHM(0) .MSFa. 4 l 92 64 132 0.000 -3.624
2.700
-ns1.lordynet.or 129.215.42.240 3 u 30 128 376 0.299 -0.597
0.056
-ns3.lordynet.or 129.215.160.240 3 u 85 128 376 0.371 -0.006
0.099
-ns0.lordynet.or 195.173.57.232 3 u 12 128 377 0.383 1.133
0.652
*xxxxxxxxxxxxxxx .PPSb. 1 u 64 128 377 0.695 0.163
0.378
xxxxxxxxxxxxxxxx xxxxxxxxxxxxxx 2 u 98 256 377 21.771 -0.086
0.307
xxxxxxxxxxxxxxxx xxxxxxxxxxxxxx 2 u 190 256 377 23.925 0.833
0.348
xxxxxxxxxxxxxxxx xxxxxxxxxxxxxx 2 u 36 256 377 29.408 0.156
0.170
xxxxxxxxxxxxxxxx xxxxxxxxxxxxxx 2 u 66 256 377 30.961 1.058
0.219
+xxxxxxxxxxxxxxx xxxxxxxxxxxxxx 2 u 111 256 377 48.076 0.179
0.355
David
More information about the questions
mailing list