[ntp:questions] Panic stop captured
agcarver+ntp at acarver.net
Sat Dec 15 02:07:09 UTC 2012
On 12/14/2012 17:49, Brian Utterback wrote:
> On 12/14/12 16:11, A C wrote:
>> I finally had a panic stop with an instrumented copy of 4.2.7p270
>> which was capturing various values inside the functions
>> refclock_process_f and refclock_process_offset.
>> I'm going to post an entire capture of the log but one thing I noticed
>> is that the value of pp->nsec at the top of refclock_process_f slowly
>> ticks down and then wraps around. Eventually a panic stop happens:
>> panic_stop +2147483648 s; set clock manually within 1000 s.
>> (but the clock hasn't changed, the system time is still correct to
>> within a second of all my other systems).
>> Apparently the wrap around happens pretty frequently so it's not
>> likely the cause of the problem.
>> For now the log is available at http://acarver.net/ntpd/ntpd_panic.log
>> (8 MB file, be warned) in case any of you has an idea about what might
>> be happening to cause the panic or there's anything else I should
>> instrument and write to the logs instead. There are several steps in
>> each of the functions written out to the log looking for overflows or
>> similar. I do have various statistics files for that period, too, if
>> they would be useful.
>> questions mailing list
>> questions at lists.ntp.org
> Not know where how the executable was instrumented, it is hard to tell
> exactly what all of this output means. But there is one thing that jumps
> out at me. Exactly at the time that the panic stop occurs, ntpd has just
> peered with the SHM refclock. This suggests that the time stored in
> shared memory is either uninitialized or is the wrong format.
The two functions are named in the output and the location inside the
functions also listed (i.e. "after L_SUB", etc.). I could probably also
post the modified functions so the instrumentation is documented.
Yes, the SHM is configured currently as the preferred peer and PPS is
the system peer usually. SHM is coming from gpsd and PPS is coming
directly from a serial port (separate from the one used by gpsd).
These issues are so random it's hard to keep track. I had to restart
ntpd two days ago because of it spinning out of control (locked in some
kind of loop) but nothing obvious. A backtrace suggests a floating point
problem but I don't have that debugged yet. Prior to that it has been
almost a month since the last panic stop. I'm not around to watch it
happen usually, I just see that the system no longer responds to ntpq
queries from a remote system (I poll once every five seconds to watch
the billboard) and I go investigate.
The file I modified is ntp_refclock.c and the modified version is here
More information about the questions