[ntp:questions] Sudden change in precision and jitter

A C agcarver+ntp at acarver.net
Tue Jun 11 16:46:37 UTC 2013


On 6/2/2013 15:26, A C wrote:
> On 6/2/2013 13:33, David Woolley wrote:
>> Harlan Stenn wrote:
>>> A C writes:
>>>> On 6/2/2013 02:24, David Woolley wrote:
>>>>>> That would be interesting since I have a cron job restarting it at an
>>>>>> odd hour away from any other cron jobs left.  I'll check and see if
>>>>> Why are you restarting it?  ntpd works  best if left to run
>>>>> continuously.
>>>> I know it does...unless there is a bug (a compound bug between ntpd
>>>> and the kernel) that causes ntpd to spin out of control every few
>>>> weeks and forces me to restart it anyway.  By spin out of control I
>>>> do mean that CPU usage goes to near 100% and ntpd stops disciplining
>>>> the clock after it managed to force the clock to run at some insane
>>>> rate (e.g. nominal PPM tick adjustment might be -78 and it ramps the
>>>> tick to +350 PPM over a few minutes).  The end result is that the
>>>> clock is very wrong, ntpd has totally stopped doing anything, but
>>>> somehow it's caught in an infinite loop with maximum CPU usage
>>>> meaning almost nothing else on the system is working right.
>>>
>>> It would be really great to run truss or something in this case, and
>>> I would also hope that the GSoC project to upgrade NTP's loggingd and
>>> debugging interface will help with this too.
>>>
>>> If you want to open a bug report on this that's fine too - we won't be
>>> able to do much about it until we can repeat it or debug it, but it
>>> might be useful information for the aforementioned GSoC project.
>>
>> I think this could be an already known NetBSD bug, see
>> http://groups.google.com/group/comp.protocols.time.ntp/browse_thread/thread/dd92eaf7bdb1c1f3/4449c41fcd6a78bd
>>
>
> Possibly but there's some situation unique to ntpd that triggers the
> issues.  I kept going with that thread on the NetBSD list because the
> problem still showed up although with a longer period before the crash.
>   Various test programs were written and compiled to stress all of the
> floating point operators that ntpd might use but nothing ever happened,
> the system ran the test applications without fail.  That's why I said at
> the beginning of this particular thread that it's a combination bug,
> NetBSD on sparc plus something very peculiar to ntpd's operations trips
> it up.  Any other program never has an issue.
>

It didn't take too long before everything destabilized and ntpd stopped. 
  In this case it just spun out and dropped sync but still remained 
responsive.

Belowis the main log snippet just before it went haywire.  The rest of 
the logs will get posted online for download.  Everything was working 
pretty well (just sys_fuzz messages showing up in the logs prior to 
this) and then there's a sudden set of spike detects as the clock 
wobbles and falls over (no_sys_peer).  It's still performing 
calculations and responding to ntpq queries but it's no longer 
controlling the clock and, according to the offsets with the external 
servers, the clock was now over 30 seconds off.

The times below are UTC.  In the remaining logs, the incident occurs 
somewhere near 56454d 32900s.  I'm leaving the full log intact since 
0000 UTC so you can see any trends.  Everything is quite stable right up 
until very near 56454 32906.  According to the clockstats file, the 
offset jumps from -0.000022767 to 0.004875262 in one second.  Ten 
seconds later the offset is 0.034419235.  It ramps up until it hits 
0.497311954 and suddenly changes sign to -0.497750784.  Then it ramps 
down to zero, ramps up to 0.496989164 and the sign flips again.

The clockstats doesn't show anything else out of the ordinary.  It has 
the line from the SHM driver showing 16 samples received and some of 
them being bad but the majority good.  But this happens for quite a 
while and can be seen throughout the clockstats file.

In the loops file, the PPM value was sitting stable near -75.68 when it 
suddenly starts to change (dropping to -74.3, then -71.6 then -67.85) 
and then a rapid jump to -139.457 where it steadily dropped to -376 
until I killed the process.

The loops file doesn't seem to show anything abnormal until the event 
occurs but I'm not certain.  Same for the rawstats and sysstats files. 
I also don't really see anything in peers.

The diagnostic log files are at http://acarver.net/ntpd/crashdata/


Jun 11 06:16:54 sunipx2 ntpd[7545]: ts_min 1370931414.425737206 ts_prev 
1370931414.425632206 ts 1370931414.462732423
Jun 11 06:16:54 sunipx2 ntpd[7545]: sys_fuzz 105000 nsec, this fuzz 
0.000041586, prior 0.000088949
Jun 11 06:16:54 sunipx2 ntpd[7545]: get_systime prev result 
0xd5613f56.7a3a2e65 is 0.014676266 later than 0xd5613f56.76785b81
Jun 11 09:09:02 sunipx2 ntpd[7545]: 0.0.0.0 0113 03 spike_detect +0.167383 s
Jun 11 09:11:26 sunipx2 ntpd[7545]: 0.0.0.0 0115 05 clock_sync
Jun 11 09:12:22 sunipx2 ntpd[7545]: 0.0.0.0 0113 03 spike_detect +0.157203 s
Jun 11 09:14:54 sunipx2 ntpd[7545]: 0.0.0.0 0115 05 clock_sync
Jun 11 09:15:42 sunipx2 ntpd[7545]: 0.0.0.0 0113 03 spike_detect +0.147021 s
Jun 11 09:18:14 sunipx2 ntpd[7545]: 0.0.0.0 0115 05 clock_synckk
Jun 11 09:19:02 sunipx2 ntpd[7545]: 0.0.0.0 0113 03 spike_detect +0.136838 s
Jun 11 09:21:34 sunipx2 ntpd[7545]: 0.0.0.0 0115 05 clock_sync
Jun 11 09:22:30 sunipx2 ntpd[7545]: 0.0.0.0 0113 03 spike_detect +0.166043 s
Jun 11 09:24:54 sunipx2 ntpd[7545]: 0.0.0.0 0115 05 clock_sync
Jun 11 09:25:50 sunipx2 ntpd[7545]: 0.0.0.0 0113 03 spike_detect +0.155848 s
Jun 11 09:26:59 sunipx2 ntpd[7545]: 0.0.0.0 0118 08 no_sys_peer
Jun 11 09:50:12 sunipx2 ntpd[7545]: ts_min 1370944211.340914811 ts_prev 
1370944211.340809811 ts 1370944212.335866579
Jun 11 09:50:12 sunipx2 ntpd[7545]: sys_fuzz 105000 nsec, this fuzz 
0.000002690, prior 0.000025571
Jun 11 09:50:12 sunipx2 ntpd[7545]: get_systime prev result 
0xd5617154.564041fa is 0.001012620 later than 0xd5617154.55fde50a
Jun 11 12:44:12 sunipx2 ntpd[7545]: ts_min 1370954651.338258199 ts_prev 
1370954651.338153199 ts 1370954652.344166908
Jun 11 12:44:12 sunipx2 ntpd[7545]: sys_fuzz 105000 nsec, this fuzz 
0.000079607, prior 0.000012962
Jun 11 12:44:12 sunipx2 ntpd[7545]: get_systime prev result 
0xd5619a1c.58628820 is 0.001006960 later than 0xd5619a1c.58208a22
Jun 11 14:26:14 sunipx2 ntpd[7545]: ntpd exiting on signal 15 (Terminated)





More information about the questions mailing list